how to pass request headers to Percy snapshot? - node.js

I'm trying to use Percy.io to snapshot pages on a site that requires a custom header in the request (any requests missing this header receive a HTTP 403).
I'm looking at the docs here:
https://docs.percy.io/docs/cli-configuration#snapshot
I'm confused. The discovery section includes a request-headers option:
request-headers: An object containing HTTP headers to be sent for each request made during asset discovery.
But that seems to relate only to asset discovery - fetching CSS, JS and other page assets required by the URL I'm trying to snapshot.
I want to send a custom HTTP header with the original request; the one for the URL I want a snapshot of. And I don't see any option for it. But... it must be possible, no?
Here's what I'm doing:
const puppeteer = require('puppeteer');
const percySnapshot = require('#percy/puppeteer');
const pageReferences = [
'https://www.my-url.com',
];
const options = {
requestHeaders: {
"proxy-authorization":"************"
}
};
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox']
});
let page = await browser.newPage();
await page.goto(pageReferences[0]);
await percySnapshot(page, pageReferences[0], options);
})();
That gives me a snapshot of a 403 error page. But I can otherwise reach the page fine with the correct header:
$ curl -I -H "proxy-authorization:***********" https://www.my-url.com
HTTP/2 200
What am I missing here?

Related

Refreshing PubSubHubbub subscription for Youtube API in NodeJS

I managed to subscribe to a Youtube Feed through https://pubsubhubbub.appspot.com/subscribe on a web browser, I've been trying to refresh my subscription to PubSubHubbub through Fetch on NodeJS, the code I used is below
const details = {
'hub.mode': 'subscribe',
'hub.topic': 'https://www.youtube.com/xml/feeds/videos.xml?channel_id=${process.env.ID}',
'hub.callback': process.env.callback,
'hub.secret': process.env.secret
};
const endpoint = 'http://pubsubhubbub.appspot.com/subscribe'
var formBody = [];
for (const property in details) {
const encodedKey = encodeURIComponent(property);
const encodedValue = encodeURIComponent(details[property]);
formBody.push(encodedKey + "=" + encodedValue);
}
formBody = formBody.join("&");
const result = await fetch(endpoint, { method: "POST", body: formBody, 'Content-Type': 'application/x-www-form-urlencoded'})
console.log(result.status, await result.text())
Expected:
Status 204
Actual:
Status 400, content: "Invalid value for hub.mode:"
I expected at least for the content to tell me what the invalid value was, however it ended up being blank, it appears to me that it did not manage to read the POSTed content at all. I'm looking for ways to improve my code so that I don't encounter this problem.
I realised that I was supposed to specify in the content type header that charset=UTF-8. I'm not particularly well-versed in sending requests as of yet so this is definitely an eye-opener for me. After adding it in the server responds with 204 as expected and so I'll close my question.
A workout that I tried while trying to fix this was to import the exec() function from child_process module to run cURL and post the request instead. However I would recommend to stick to one language to make things more readable and using exec() runs commands directly from the shell which may have undesirable effects.

Webscraping TimeoutError: Navigation timeout of 30000 ms exceeded

I'm trying to extract some table from a company website using puppeteer.
But I don't understand why the browser open Chromium instead my default Chrome, which then lead to "TimeoutError: Navigation timeout of 30000 ms exceeded", not let me enough time to use CSS Selector. I don't see any document about this.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage()
await page.goto('https://www....com');
//search tearm
await page.type("#search_term","Brazil");
//await page.screenshot({path: 'sc2.png'});
//await browser.close();
})();
Puppeteer, is Chromium based by default.
If you wish to use Chrome instead you have to specify the executable path through the executablePath launch parameter. But to be honest, most of the time, there is no point to do so.
let browser = await puppeteer.launch({
executablePath: `/path/to/Chrome`,
//...
});
There is no correlation between TimeoutError: Navigation timeout of 30000 ms exceeded and the use chromium rather it is more likely that your target url isn't (yet) available.
page.goto will throw an error if:
there's an SSL error (e.g. in case of self-signed certificates).
target URL is invalid.
the timeout is exceeded during navigation.
the remote server does not respond or is unreachable.
the main resource failed to load.
By default, the maximum navigation timeout is 30 seconds. If for some reason, your target url requires more time to load (which seems unlikely), you can specify a timeout: 0 option.
await page.goto(`https://github.com/`, {timeout: 0});
As Puppeteer will not throw an error when an HTTP status code is returned...
page.goto will not throw an error when any valid HTTP status code is returned by the remote server, including 404 "Not Found" and 500 "Internal Server Error".
I usually check the HTTP response status codes to make sure I'm not encountering any 404 Client error responses Bad Request.
let status = await page.goto(`https://github.com/`);
status = status.status();
if (status != 404) {
console.log(`Probably HTTP response status code 200 OK.`);
//...
};
I'm flying blind here as I don't have your target url nor more information on what you're trying to accomplish.
You should also give the puppeteer api documentation a read.
Below approach works for me. Try adding this following "1 Liner" to your code.
The setDefaultNavigationTimeout method allows you to define the timeout of the tab and expects as first argument, the value in milliseconds. Here a value of 0 means an unlimited amount of time. Since I know my page will load up someday.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage()
// Add the below 1 line of code
await page.setDefaultNavigationTimeout(0);
// follows the rest of your code block
})();

I'm requesting html content from a site with axios in JS but the website is blocking my request

I want my script to pull the html data from a site, but it is returning a page that says it knows my script is a bot and giving it an 'I am not a robot' test to pass.
Instead of returning the content of the site it returns a page that partly reads...
"As you were browsing, something about your browser\n made us think you were a bot."
My code is...
const axios = require('axios');
const url = "https://www.bhgre.com/Better-Homes-and-Gardens-Real-Estate-Blu-Realty-49231c/Brady-Johnson-7469865a";
axios(url, {headers: {
'Mozilla': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.3 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/43.4.0',
}})
.then(response => {
const html = response.data;
console.log(html)
})
.catch(console.error);
I've tried a few different headers, but there's no fooling the site into thinking my script is human. This is in NodeJS.
Maybe this does or does not have bearing on my issue, but this code will hopefully live on the backend of my site in React I'm building. I'm not trying to scrape the site as a one off. I would like my site to read from this site for a little bit of content, instead of having to manually update my site with bits of content on this site whenever it changes.
Accessing every site using axios or curl is not possible. There are various kinds of checks including CORS that can prevent someone to access a site directly via a client other than the browser.
You can achieve the same using phantom (https://www.npmjs.com/package/phantom). This is commonly used by scrapers and if you're afraid that the other site may block you for repeated access, you can use a random interval before making requests. If you need to read something from the returned HTML page, you can use cheerio (https://www.npmjs.com/package/cheerio).
Hope it helps.
Below is the code that I tried and worked for your URL:
const phantom = require('phantom');
(async () => {
const url = "https://www.bhgre.com/Better-Homes-and-Gardens-Real-Estate-Blu-Realty-49231c/Brady-Johnson-7469865a";
const instance = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = await instance.createPage();
const status = await page.open(url);
if (status !== 'success') {
console.error(status);
await instance.exit();
return;
}
const content = await page.property('content');
await instance.exit();
console.log(content);
})();

How to pass two parameters inside a google cloud puppeteer function? Resolve to req to JSON

I'm building an application that takes two parameters; the request URL and a CSS query selector. I'm having a hard time getting the request to look like:
"http://localhost:5000/scrapeme/us-central1/scraperSelector?requestURL=https://www.google.com&selector=#hplogo". The request is not accepting the selector variable and returning not defined.
I'm not really sure what I'm doing wrong and I've tried different methods such as to request.body or creating an object and pass that down in the code. I've read the Google documentation and couldn't really find a good example of passing multiple parameters in a cloud function.
const admin = require('firebase-admin');
const functions = require('firebase-functions');
const puppeteer = require("puppeteer");
const chalk = require("chalk");
admin.initializeApp();
// for yellow console logging
const checking = chalk.bold.yellow;
// const uri = "http://localhost:5000/scrapeme/us-central1/scraperSelector";
// const appURL = "scrapeme.firebaseapp.com";
exports.scraperSelector = functions.runWith({ memory: '1GB' }).https.onRequest(async(request, response) => {
// initialize varialbe to request params
const requestURL = request.query.requestURL;
console.log("Evaluating " + requestURL);
let selector = request.query.selector;
console.log("Evaluating " + selector);
console.log("Evaluating " + request.originalUrl);
// Launch a browser
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
// Visit the page a get content
const page = await browser.newPage();
// Go to requested URL
await page.goto(requestURL, { waitUntil: 'networkidle0' });
console.log(checking("Evaluating " + requestURL));
// find the css selector
const content = await page.evaluate(() => {
console.log(JSON.stringify(selector));
let selectorCSS = document.querySelector(selector).innerText;
console.log(selectorCSS);
return selectorCSS;
},);
// Send the response
response.json(content);
});
// Example URL of how request should look
// http://localhost:5000/scrapeme/us-central1/scraperSelector?requestURL=https://www.google.com&selector=#hplogo
I expect the output to resolve to a JSON response. I'm trying to grab a single item from a page.
{
"result": "$18.41"
}
However, I'm getting this output and error:
Evaluating https://www.google.com
Evaluating
Evaluating /scrapeme/us-central1/scraperSelector?requestURL=https://www.google.com&selector=
Evaluating https://www.google.com
! functions: Error: Evaluation failed: ReferenceError: selector is not defined
at puppeteer_evaluation_script:2:36
You have to pass the selector variable to the evaluate function.
//...
let selector = request.query.selector;
//...
const content = await page.evaluate(selector => { // <-- add the `selector` variable.
console.log(JSON.stringify(selector));
let selectorCSS = document.querySelector(selector).innerText;
console.log(selectorCSS);
return selectorCSS;
}, selector); // <-- add the `selector` variable
Read more docs.
The issue is that # is a special character in URLs. It signals to the web browser a string called a "fragment" that targets an anchor on a web page.
If you want to pass a parameter to a function via the query string of the URL, it should be URL escaped with percent encoding. So, with that encoding, your URL parameter would be selector=%23hplogo.
Typically, you use a library to encode all the parameters you pass, so that they're all valid no matter what string they contain.

Modify HTTP responses from a Chrome extension

Is it possible to create a Chrome extension that modifies HTTP response bodies?
I have looked in the Chrome Extension APIs, but I haven't found anything to do this.
In general, you cannot change the response body of a HTTP request using the standard Chrome extension APIs.
This feature is being requested at 104058: WebRequest API: allow extension to edit response body. Star the issue to get notified of updates.
If you want to edit the response body for a known XMLHttpRequest, inject code via a content script to override the default XMLHttpRequest constructor with a custom (full-featured) one that rewrites the response before triggering the real event. Make sure that your XMLHttpRequest object is fully compliant with Chrome's built-in XMLHttpRequest object, or AJAX-heavy sites will break.
In other cases, you can use the chrome.webRequest or chrome.declarativeWebRequest APIs to redirect the request to a data:-URI. Unlike the XHR-approach, you won't get the original contents of the request. Actually, the request will never hit the server because redirection can only be done before the actual request is sent. And if you redirect a main_frame request, the user will see the data:-URI instead of the requested URL.
I just released a Devtools extension that does just that :)
It's called tamper, it's based on mitmproxy and it allows you to see all requests made by the current tab, modify them and serve the modified version next time you refresh.
It's a pretty early version but it should be compatible with OS X and Windows. Let me know if it doesn't work for you.
You can get it here http://dutzi.github.io/tamper/
How this works
As #Xan commented below, the extension communicates through Native Messaging with a python script that extends mitmproxy.
The extension lists all requests using chrome.devtools.network.onRequestFinished.
When you click on of the requests it downloads its response using the request object's getContent() method, and then sends that response to the python script which saves it locally.
It then opens file in an editor (using call for OSX or subprocess.Popen for windows).
The python script uses mitmproxy to listen to all communication made through that proxy, if it detects a request for a file that was saved it serves the file that was saved instead.
I used Chrome's proxy API (specifically chrome.proxy.settings.set()) to set a PAC as the proxy setting. That PAC file redirect all communication to the python script's proxy.
One of the greatest things about mitmproxy is that it can also modify HTTPs communication. So you have that also :)
Like #Rob w said, I've override XMLHttpRequest and this is a result for modification any XHR requests in any sites (working like transparent modification proxy):
var _open = XMLHttpRequest.prototype.open;
window.XMLHttpRequest.prototype.open = function (method, URL) {
var _onreadystatechange = this.onreadystatechange,
_this = this;
_this.onreadystatechange = function () {
// catch only completed 'api/search/universal' requests
if (_this.readyState === 4 && _this.status === 200 && ~URL.indexOf('api/search/universal')) {
try {
//////////////////////////////////////
// THIS IS ACTIONS FOR YOUR REQUEST //
// EXAMPLE: //
//////////////////////////////////////
var data = JSON.parse(_this.responseText); // {"fields": ["a","b"]}
if (data.fields) {
data.fields.push('c','d');
}
// rewrite responseText
Object.defineProperty(_this, 'responseText', {value: JSON.stringify(data)});
/////////////// END //////////////////
} catch (e) {}
console.log('Caught! :)', method, URL/*, _this.responseText*/);
}
// call original callback
if (_onreadystatechange) _onreadystatechange.apply(this, arguments);
};
// detect any onreadystatechange changing
Object.defineProperty(this, "onreadystatechange", {
get: function () {
return _onreadystatechange;
},
set: function (value) {
_onreadystatechange = value;
}
});
return _open.apply(_this, arguments);
};
for example this code can be used successfully by Tampermonkey for making any modifications on any sites :)
Yes. It is possible with the chrome.debugger API, which grants extension access to the Chrome DevTools Protocol, which supports HTTP interception and modification through its Network API.
This solution was suggested by a comment on Chrome Issue 487422:
For anyone wanting an alternative which is doable at the moment, you can use chrome.debugger in a background/event page to attach to the specific tab you want to listen to (or attach to all tabs if that's possible, haven't tested all tabs personally), then use the network API of the debugging protocol.
The only problem with this is that there will be the usual yellow bar at the top of the tab's viewport, unless the user turns it off in chrome://flags.
First, attach a debugger to the target:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
// TODO
});
});
Next, send the Network.setRequestInterceptionEnabled command, which will enable interception of network requests:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
});
Chrome will now begin sending Network.requestIntercepted events. Add a listener for them:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
chrome.debugger.onEvent.addListener((source, method, params) => {
if(source.targetId === target.id && method === "Network.requestIntercepted") {
// TODO
}
});
});
In the listener, params.request will be the corresponding Request object.
Send the response with Network.continueInterceptedRequest:
Pass a base64 encoding of your desired HTTP raw response (including HTTP status line, headers, etc!) as rawResponse.
Pass params.interceptionId as interceptionId.
Note that I have not tested any of this, at all.
While Safari has this feature built-in, the best workaround I've found for Chrome so far is to use Cypress's intercept functionality. It cleanly allows me to stub HTTP responses in Chrome. I call cy.intercept then cy.visit(<URL>) and it intercepts and provides a stubbed response for a specific request the visited page makes. Here's an example:
cy.intercept('GET', '/myapiendpoint', {
statusCode: 200,
body: {
myexamplefield: 'Example value',
},
})
cy.visit('http://localhost:8080/mytestpage')
Note: You may also need to configure Cypress to disable some Chrome-specific security settings.
The original question was about Chrome extensions, but I notice that it has branched out into different methods, going by the upvotes on answers that have non-Chrome-extension methods.
Here's a way to kind of achieve this with Puppeteer. Note the caveat mentioned on the originalContent line - the fetched response may be different to the original response in some circumstances.
With Node.js:
npm install puppeteer node-fetch#2.6.7
Create this main.js:
const puppeteer = require("puppeteer");
const fetch = require("node-fetch");
(async function() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
})();
Run it:
node main.js
With Deno:
Install Deno:
curl -fsSL https://deno.land/install.sh | sh # linux, mac
irm https://deno.land/install.ps1 | iex # windows powershell
Download Chrome for Puppeteer:
PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/puppeteer#16.2.0/install.ts
Create this main.js:
import puppeteer from "https://deno.land/x/puppeteer#16.2.0/mod.ts";
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
Run it:
deno run -A --unstable main.js
(I'm currently running into a TimeoutError with this that will hopefully be resolved soon: https://github.com/lucacasonato/deno-puppeteer/issues/65)
Yes, you can modify HTTP response in a Chrome extension. I built ModResponse (https://modheader.com/modresponse) that does that. It can record and replay your HTTP response, modify it, add delay, and even use the HTTP response from a different server (like from your localhost)
The way it works is to use the chrome.debugger API (https://developer.chrome.com/docs/extensions/reference/debugger/), which gives you access to Chrome DevTools Protocol (https://chromedevtools.github.io/devtools-protocol/). You can then intercept the request and response using the Fetch Domain API (https://chromedevtools.github.io/devtools-protocol/tot/Fetch/), then override the response you want. (You can also use the Network Domain, though it is deprecated in favor of the Fetch Domain)
The nice thing about this approach is that it will just work out of box. No desktop app installation required. No extra proxy setup. However, it will show a debugging banner in Chrome (which you can add an argument to Chrome to hide), and it is significantly more complicated to setup than other APIs.
For examples on how to use the debugger API, take a look at the chrome-extensions-samples: https://github.com/GoogleChrome/chrome-extensions-samples/tree/main/mv2-archive/api/debugger/live-headers
I've just found this extension and it does a lot of other things but modifying api responses in the browser works really well: https://requestly.io/
Follow these steps to get it working:
Install the extension
Go to HttpRules
Add a new rule and add a url and a response
Enable the rule with the radio button
Go to Chrome and you should see the response is modified
You can have multiple rules with different responses and enable/disable as required. I've not found out how you can have a different response per request though if the url is the same unfortunately.

Resources