Webscraping TimeoutError: Navigation timeout of 30000 ms exceeded - node.js

I'm trying to extract some table from a company website using puppeteer.
But I don't understand why the browser open Chromium instead my default Chrome, which then lead to "TimeoutError: Navigation timeout of 30000 ms exceeded", not let me enough time to use CSS Selector. I don't see any document about this.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage()
await page.goto('https://www....com');
//search tearm
await page.type("#search_term","Brazil");
//await page.screenshot({path: 'sc2.png'});
//await browser.close();
})();

Puppeteer, is Chromium based by default.
If you wish to use Chrome instead you have to specify the executable path through the executablePath launch parameter. But to be honest, most of the time, there is no point to do so.
let browser = await puppeteer.launch({
executablePath: `/path/to/Chrome`,
//...
});
There is no correlation between TimeoutError: Navigation timeout of 30000 ms exceeded and the use chromium rather it is more likely that your target url isn't (yet) available.
page.goto will throw an error if:
there's an SSL error (e.g. in case of self-signed certificates).
target URL is invalid.
the timeout is exceeded during navigation.
the remote server does not respond or is unreachable.
the main resource failed to load.
By default, the maximum navigation timeout is 30 seconds. If for some reason, your target url requires more time to load (which seems unlikely), you can specify a timeout: 0 option.
await page.goto(`https://github.com/`, {timeout: 0});
As Puppeteer will not throw an error when an HTTP status code is returned...
page.goto will not throw an error when any valid HTTP status code is returned by the remote server, including 404 "Not Found" and 500 "Internal Server Error".
I usually check the HTTP response status codes to make sure I'm not encountering any 404 Client error responses Bad Request.
let status = await page.goto(`https://github.com/`);
status = status.status();
if (status != 404) {
console.log(`Probably HTTP response status code 200 OK.`);
//...
};
I'm flying blind here as I don't have your target url nor more information on what you're trying to accomplish.
You should also give the puppeteer api documentation a read.

Below approach works for me. Try adding this following "1 Liner" to your code.
The setDefaultNavigationTimeout method allows you to define the timeout of the tab and expects as first argument, the value in milliseconds. Here a value of 0 means an unlimited amount of time. Since I know my page will load up someday.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage()
// Add the below 1 line of code
await page.setDefaultNavigationTimeout(0);
// follows the rest of your code block
})();

Related

how to pass request headers to Percy snapshot?

I'm trying to use Percy.io to snapshot pages on a site that requires a custom header in the request (any requests missing this header receive a HTTP 403).
I'm looking at the docs here:
https://docs.percy.io/docs/cli-configuration#snapshot
I'm confused. The discovery section includes a request-headers option:
request-headers: An object containing HTTP headers to be sent for each request made during asset discovery.
But that seems to relate only to asset discovery - fetching CSS, JS and other page assets required by the URL I'm trying to snapshot.
I want to send a custom HTTP header with the original request; the one for the URL I want a snapshot of. And I don't see any option for it. But... it must be possible, no?
Here's what I'm doing:
const puppeteer = require('puppeteer');
const percySnapshot = require('#percy/puppeteer');
const pageReferences = [
'https://www.my-url.com',
];
const options = {
requestHeaders: {
"proxy-authorization":"************"
}
};
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox']
});
let page = await browser.newPage();
await page.goto(pageReferences[0]);
await percySnapshot(page, pageReferences[0], options);
})();
That gives me a snapshot of a 403 error page. But I can otherwise reach the page fine with the correct header:
$ curl -I -H "proxy-authorization:***********" https://www.my-url.com
HTTP/2 200
What am I missing here?

How to get lastModified property of another website

When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.

How to kill old Puppeteer browser if still running?

I am trying to scrape data from different websites using only one Puppeteer instance. I don't want to launch a new browser for each website. So I need to check if any existing browser has already launched then just open a new tab. I did something like the below, Some conditions I always check before launching any browser
const browser = await puppeteer.launch();
browser?.isConnected()
browser.process() // null if browser is still running
but still, I found sometimes my script re-launch the browser if any old browser has already been launched. So I am thinking to kill if any old browser has been launched or what would be the best check? Any other good suggestion will be highly appreciated.
I'm not sure if that specific command (Close existing browsers) can be done inside puppeteer's APIs, but what I could recommend is how would people usually handle this situation which is to make sure that the browser instance is closed if any issue was encountered:
let browser = null;
try {
browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
const bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
} catch (e) {
console.log(e);
} finally {
if (browser)
await browser.close();
}
Otherwise, you can use shell based commands like kill or pkill if you have access to the process ID of the previous browser.
The most reliable means of closing a puppeteer instance that I've found is to close all of the pages within a BrowserContext, which automatically closes the Browser. I've seen instances of chromium linger in Task Manager after calling just await browser.close().
Here is how I do this:
const openAndCloseBrowser = async () => {
const browser = await puppeteer.launch();
try {
// your logic
catch(ERROR) {
// error handling
} finally {
const pages = await browser.pages();
for(const page of pages) await page.close();
}
}
If you try running await browser.close() after running the loop and closing each page individually, you should see an error stating that the browser was already closed and your Task Manager should not have lingering chromium instances.

Simple way to add Firefox Extensions/Add Ons

I know with Pyppeteer (Puppeteer) or Selenium, I can simply add chrome/chromium extensions by including them in args like this:
args=[
f'--disable-extensions-except={pathToExtension}',
f'--load-extension={pathToExtension}'
]
I also know the selenium has the very useful load_extension fx.
I was wondering if there was a similarly easy way to load extensions/add ons in firefox for Playwright? Or perhaps just with the firefox_user_args
I've seen an example in JS using this:
const path = require('path');
const {firefox} = require('playwright');
const webExt = require('web-ext').default;
(async () => {
// 1. Enable verbose logging and start capturing logs.
webExt.util.logger.consoleStream.makeVerbose();
webExt.util.logger.consoleStream.startCapturing();
// 2. Launch firefox
const runner = await webExt.cmd.run({
sourceDir: path.join(__dirname, 'webextension'),
firefox: firefox.executablePath(),
args: [`-juggler=1234`],
}, {
shouldExitProgram: false,
});
// 3. Parse firefox logs and extract juggler endpoint.
const JUGGLER_MESSAGE = `Juggler listening on`;
const message = webExt.util.logger.consoleStream.capturedMessages.find(msg => msg.includes(JUGGLER_MESSAGE));
const wsEndpoint = message.split(JUGGLER_MESSAGE).pop();
// 4. Connect playwright and start driving browser.
const browser = await firefox.connect({ wsEndpoint });
const page = await browser.newPage();
await page.goto('https://mozilla.org');
// .... go on driving ....
})();
Is there anything similar for python?
Tldr; Code at the end
After wasting too much time into this, I have found a way to install extensions in Firefox in Playwright, feature that I think it is not to be supported for now, since Chromium has that feature and works.
Since in firefox adding an extension requires user clicking a special popup that raises when you click to install the extension, I figured it was easier just to download the xpi file and then install it through the file.
To install a file as an extension, we need to get to the url 'about:debugging#/runtime/this-firefox', to install a temporary extension.
But in that url you cannot use the console or the dom due to protection that firefox has and that I haven't been able to avoid.
However, we know that about:debugging runs in a special tab id, so whe can open a new tab 'about:devtools-toolbox' where we can fake user inputs to run commands in a GUI console.
The code on how to run a file is to load the file as 'nsIFile'. To do that we make use of the already loaded packages in 'about:debugging' and we load the required packages.
The following code is Python, but I guess translating it into Javascript should be no big deal
# get the absolute path for all the xpi extensions
extensions = [os.path.abspath(f"Settings/Addons/{file}") for file in os.listdir("Settings/Addons") if file.endswith(".xpi")]
if(not len(extensions)):
return
c1 = "const { AddonManager } = require('resource://gre/modules/AddonManager.jsm');"
c2 = "const { FileUtils } = require('resource://gre/modules/FileUtils.jsm');"
c3 = "AddonManager.installTemporaryAddon(new FileUtils.File('{}'));"
context = await browser.new_context()
page = await context.new_page()
page2 = await context.new_page()
await page.goto("about:debugging#/runtime/this-firefox", wait_until="domcontentloaded")
await page2.goto("about:devtools-toolbox?id=9&type=tab", wait_until="domcontentloaded")
await asyncio.sleep(1)
await page2.keyboard.press("Tab")
await page2.keyboard.down("Shift")
await page2.keyboard.press("Tab")
await page2.keyboard.press("Tab")
await page2.keyboard.up("Shift")
await page2.keyboard.press("ArrowRight")
await page2.keyboard.press("Enter")
await page2.keyboard.type(f"{' '*10}{c1}{c2}")
await page2.keyboard.press("Enter")
for extension in extensions:
print(f"Adding extension: {extension}")
await asyncio.sleep(0.2)
await page2.keyboard.type(f"{' '*10}{c3.format(extension)}")
await page2.keyboard.press("Enter")
#await asyncio.sleep(0.2)
await page2.bring_to_front()
Note that there are some sleep because the page needs to load but Playwright cannot detect it
I needed to add some whitespaces because for some reason, playwright or firefox were missing some of the first characters in the commands
Also, if you want to install more than one addon, I suggest you try to find the amount of sleep before bringing to front in case the addon opens a new tab

Modify HTTP responses from a Chrome extension

Is it possible to create a Chrome extension that modifies HTTP response bodies?
I have looked in the Chrome Extension APIs, but I haven't found anything to do this.
In general, you cannot change the response body of a HTTP request using the standard Chrome extension APIs.
This feature is being requested at 104058: WebRequest API: allow extension to edit response body. Star the issue to get notified of updates.
If you want to edit the response body for a known XMLHttpRequest, inject code via a content script to override the default XMLHttpRequest constructor with a custom (full-featured) one that rewrites the response before triggering the real event. Make sure that your XMLHttpRequest object is fully compliant with Chrome's built-in XMLHttpRequest object, or AJAX-heavy sites will break.
In other cases, you can use the chrome.webRequest or chrome.declarativeWebRequest APIs to redirect the request to a data:-URI. Unlike the XHR-approach, you won't get the original contents of the request. Actually, the request will never hit the server because redirection can only be done before the actual request is sent. And if you redirect a main_frame request, the user will see the data:-URI instead of the requested URL.
I just released a Devtools extension that does just that :)
It's called tamper, it's based on mitmproxy and it allows you to see all requests made by the current tab, modify them and serve the modified version next time you refresh.
It's a pretty early version but it should be compatible with OS X and Windows. Let me know if it doesn't work for you.
You can get it here http://dutzi.github.io/tamper/
How this works
As #Xan commented below, the extension communicates through Native Messaging with a python script that extends mitmproxy.
The extension lists all requests using chrome.devtools.network.onRequestFinished.
When you click on of the requests it downloads its response using the request object's getContent() method, and then sends that response to the python script which saves it locally.
It then opens file in an editor (using call for OSX or subprocess.Popen for windows).
The python script uses mitmproxy to listen to all communication made through that proxy, if it detects a request for a file that was saved it serves the file that was saved instead.
I used Chrome's proxy API (specifically chrome.proxy.settings.set()) to set a PAC as the proxy setting. That PAC file redirect all communication to the python script's proxy.
One of the greatest things about mitmproxy is that it can also modify HTTPs communication. So you have that also :)
Like #Rob w said, I've override XMLHttpRequest and this is a result for modification any XHR requests in any sites (working like transparent modification proxy):
var _open = XMLHttpRequest.prototype.open;
window.XMLHttpRequest.prototype.open = function (method, URL) {
var _onreadystatechange = this.onreadystatechange,
_this = this;
_this.onreadystatechange = function () {
// catch only completed 'api/search/universal' requests
if (_this.readyState === 4 && _this.status === 200 && ~URL.indexOf('api/search/universal')) {
try {
//////////////////////////////////////
// THIS IS ACTIONS FOR YOUR REQUEST //
// EXAMPLE: //
//////////////////////////////////////
var data = JSON.parse(_this.responseText); // {"fields": ["a","b"]}
if (data.fields) {
data.fields.push('c','d');
}
// rewrite responseText
Object.defineProperty(_this, 'responseText', {value: JSON.stringify(data)});
/////////////// END //////////////////
} catch (e) {}
console.log('Caught! :)', method, URL/*, _this.responseText*/);
}
// call original callback
if (_onreadystatechange) _onreadystatechange.apply(this, arguments);
};
// detect any onreadystatechange changing
Object.defineProperty(this, "onreadystatechange", {
get: function () {
return _onreadystatechange;
},
set: function (value) {
_onreadystatechange = value;
}
});
return _open.apply(_this, arguments);
};
for example this code can be used successfully by Tampermonkey for making any modifications on any sites :)
Yes. It is possible with the chrome.debugger API, which grants extension access to the Chrome DevTools Protocol, which supports HTTP interception and modification through its Network API.
This solution was suggested by a comment on Chrome Issue 487422:
For anyone wanting an alternative which is doable at the moment, you can use chrome.debugger in a background/event page to attach to the specific tab you want to listen to (or attach to all tabs if that's possible, haven't tested all tabs personally), then use the network API of the debugging protocol.
The only problem with this is that there will be the usual yellow bar at the top of the tab's viewport, unless the user turns it off in chrome://flags.
First, attach a debugger to the target:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
// TODO
});
});
Next, send the Network.setRequestInterceptionEnabled command, which will enable interception of network requests:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
});
Chrome will now begin sending Network.requestIntercepted events. Add a listener for them:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
chrome.debugger.onEvent.addListener((source, method, params) => {
if(source.targetId === target.id && method === "Network.requestIntercepted") {
// TODO
}
});
});
In the listener, params.request will be the corresponding Request object.
Send the response with Network.continueInterceptedRequest:
Pass a base64 encoding of your desired HTTP raw response (including HTTP status line, headers, etc!) as rawResponse.
Pass params.interceptionId as interceptionId.
Note that I have not tested any of this, at all.
While Safari has this feature built-in, the best workaround I've found for Chrome so far is to use Cypress's intercept functionality. It cleanly allows me to stub HTTP responses in Chrome. I call cy.intercept then cy.visit(<URL>) and it intercepts and provides a stubbed response for a specific request the visited page makes. Here's an example:
cy.intercept('GET', '/myapiendpoint', {
statusCode: 200,
body: {
myexamplefield: 'Example value',
},
})
cy.visit('http://localhost:8080/mytestpage')
Note: You may also need to configure Cypress to disable some Chrome-specific security settings.
The original question was about Chrome extensions, but I notice that it has branched out into different methods, going by the upvotes on answers that have non-Chrome-extension methods.
Here's a way to kind of achieve this with Puppeteer. Note the caveat mentioned on the originalContent line - the fetched response may be different to the original response in some circumstances.
With Node.js:
npm install puppeteer node-fetch#2.6.7
Create this main.js:
const puppeteer = require("puppeteer");
const fetch = require("node-fetch");
(async function() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
})();
Run it:
node main.js
With Deno:
Install Deno:
curl -fsSL https://deno.land/install.sh | sh # linux, mac
irm https://deno.land/install.ps1 | iex # windows powershell
Download Chrome for Puppeteer:
PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/puppeteer#16.2.0/install.ts
Create this main.js:
import puppeteer from "https://deno.land/x/puppeteer#16.2.0/mod.ts";
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
Run it:
deno run -A --unstable main.js
(I'm currently running into a TimeoutError with this that will hopefully be resolved soon: https://github.com/lucacasonato/deno-puppeteer/issues/65)
Yes, you can modify HTTP response in a Chrome extension. I built ModResponse (https://modheader.com/modresponse) that does that. It can record and replay your HTTP response, modify it, add delay, and even use the HTTP response from a different server (like from your localhost)
The way it works is to use the chrome.debugger API (https://developer.chrome.com/docs/extensions/reference/debugger/), which gives you access to Chrome DevTools Protocol (https://chromedevtools.github.io/devtools-protocol/). You can then intercept the request and response using the Fetch Domain API (https://chromedevtools.github.io/devtools-protocol/tot/Fetch/), then override the response you want. (You can also use the Network Domain, though it is deprecated in favor of the Fetch Domain)
The nice thing about this approach is that it will just work out of box. No desktop app installation required. No extra proxy setup. However, it will show a debugging banner in Chrome (which you can add an argument to Chrome to hide), and it is significantly more complicated to setup than other APIs.
For examples on how to use the debugger API, take a look at the chrome-extensions-samples: https://github.com/GoogleChrome/chrome-extensions-samples/tree/main/mv2-archive/api/debugger/live-headers
I've just found this extension and it does a lot of other things but modifying api responses in the browser works really well: https://requestly.io/
Follow these steps to get it working:
Install the extension
Go to HttpRules
Add a new rule and add a url and a response
Enable the rule with the radio button
Go to Chrome and you should see the response is modified
You can have multiple rules with different responses and enable/disable as required. I've not found out how you can have a different response per request though if the url is the same unfortunately.

Resources