Capture requests (XHR, JS, CSS) from embedded iframes using devtool protocol - node.js

For the context, I am developing a synthetic monitoring tool using Nodejs and puppeteer.
For each step of a defined scenario, I capture a screenshot, a waterfall and performance metrics.
My problem is on the waterfall, I previously used puppeter-har but this package is not able to capture request outside of a navigation.
Therefore I use this piece of code to capture all interesting requests :
const {harFromMessages} = require('chrome-har');
// Event types to observe for waterfall saving (probably overkill, I just set all events of Page and Network)
const observe = [
'Page.domContentEventFired',
'Page.fileChooserOpened',
'Page.frameAttached',
'Page.frameDetached',
'Page.frameNavigated',
'Page.interstitialHidden',
'Page.interstitialShown',
'Page.javascriptDialogClosed',
'Page.javascriptDialogOpening',
'Page.lifecycleEvent',
'Page.loadEventFired',
'Page.windowOpen',
'Page.frameClearedScheduledNavigation',
'Page.frameScheduledNavigation',
'Page.compilationCacheProduced',
'Page.downloadProgress',
'Page.downloadWillBegin',
'Page.frameRequestedNavigation',
'Page.frameResized',
'Page.frameStartedLoading',
'Page.frameStoppedLoading',
'Page.navigatedWithinDocument',
'Page.screencastFrame',
'Page.screencastVisibilityChanged',
'Network.dataReceived',
'Network.eventSourceMessageReceived',
'Network.loadingFailed',
'Network.loadingFinished',
'Network.requestServedFromCache',
'Network.requestWillBeSent',
'Network.responseReceived',
'Network.webSocketClosed',
'Network.webSocketCreated',
'Network.webSocketFrameError',
'Network.webSocketFrameReceived',
'Network.webSocketFrameSent',
'Network.webSocketHandshakeResponseReceived',
'Network.webSocketWillSendHandshakeRequest',
'Network.requestWillBeSentExtraInfo',
'Network.resourceChangedPriority',
'Network.responseReceivedExtraInfo',
'Network.signedExchangeReceived',
'Network.requestIntercepted'
];
At the start of the step :
// list of events for converting to HAR
const events = [];
client = await page.target().createCDPSession();
await client.send('Page.enable');
await client.send('Network.enable');
observe.forEach(method => {
client.on(method, params => {
events.push({ method, params });
});
});
At the end of the step :
waterfall = await harFromMessages(events);
It works good for navigation events, and also for navigation inside a web application.
However, the web application I try to monitor has iframes with the main content.
I would like to see the iframes requests into my waterfall.
So a few question :
Why is Network.responseReceived or any other event doesn't capture this requests ?
Is it possible to capture such requests ?
So far I've red the devtool protocol documentation, nothing I could use.
The closest to my problem I found is this question :
How can I receive events for an embedded iframe using Chrome Devtools Protocol?
My guess is, I have to enable the Network for each iframe I may encounter.
I didn't found any way to do this. If there is a way to do it with devtool protocol, I should have no problem to implement it with nodsjs and puppeteer.
Thansk for your insights !
EDIT 18/08 :
After more searching on the subject, mostly Out-of-process iframes, lots of people on the internet point to that response :
https://bugs.chromium.org/p/chromium/issues/detail?id=924937#c13
The answer is question states :
Note that the easiest workaround is the --disable-features flag.
That said, to work with out-of-process iframes over DevTools protocol,
you need to use Target [1] domain:
Call Target.setAutoAttach with flatten=true;
You'll receive Target.attachedToTarget event with a sessionId for the iframe;
Treat that session as a separate "page" in chrome-remote-interface. Send separate protocol messages with additional sessionId field:
{id: 3, sessionId: "", method: "Runtime.enable", params:
{}}
You'll get responses and events with the same "sessionId" field which means they are coming from that frame. For example:
{sessionId: "", method: "Runtime.consoleAPICalled",
params: {...}}
However I'm still not able to implement it.
I'm trying this, mostly based on puppeteer :
const events = [];
const targets = await browser.targets();
const nbTargets = targets.length;
for(var i=0;i<nbTargets;i++){
console.log(targets[i].type());
if (targets[i].type() === 'page') {
client = await targets[i].createCDPSession();
await client.send("Target.setAutoAttach", {
autoAttach: true,
flatten: true,
windowOpen: true,
waitForDebuggerOnStart: false // is set to false in pptr
})
await client.send('Page.enable');
await client.send('Network.enable');
observeTest.forEach(method => {
client.on(method, params => {
events.push({ method, params });
});
});
}
};
But I still don't have my expected output for the navigation in a web application inside an iframe.
However I am able to capture all the requests during the step where the iframe is loaded.
What I miss are requests that happened outside of a proper navigation.
Does anyone has an idea about the integration into puppeteer of that chromium response above ? Thanks !

I was looking on the wrong side all this time.
The chrome network events are correctly captured, as I would have seen earlier if I checked the "events" variable earlier.
The problem comes from the "chrome-har" package that I use on :
waterfall = await harFromMessages(events);
The page expects the page and iframe main events to be present in the same batch of event than the requests. Otherwise the request "can't be mapped to any page at the moment".
The steps of my scenario being sometimes a navigation in the same web application (=no navigation event), I didn't have these events and chrome-har couldn't map the requests and therefore sent an empty .har
Hope it can help someone else, I messed up the debugging on this one...

Related

Web Scraping NodeJs - How to recover resources when the page loads in full after several requests

i'm trying to retrieve each item (composed of an image, a word and its translation) from this page
Link of the website: https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true"
I used JsDom and Got.
Here is the code
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const got = require('got');
(async () => {
const response = await got("https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true");
console.log(response.body);
const dom = new JSDOM(response.body);
console.log(dom.window.document.querySelectorAll(".ld-egdn1r"))
})();
when I display the html code that is returned to me it does not correspond to what I open the site with my browser.There are no html tags that contain the items.
When I look at the Network tab, other resources are loaded, but again I can't find the query that retrieves the words.
I think that what I am looking for is loaded in several queries but I don't know which one
Here are the step:
enter image description here
then you will get a code like that
fetch("https://xcvbaysyxd-dsn.algolia.net/1/indexes/*/queries", {
"credentials": "omit",
"headers": {},
"referrer": "https://livingdictionaries.app/",
"body": "...",
"method": "POST",
"mode": "cors"
});
you will just have to process the data manualy after that
const fetch = require("node-fetch") // npm i node-fetch
const data = await fetch(...).then(r=>r.json())
const product = data.results.map(r=>r.hits)
in your case
The site you are trying to scrape is a Single Page Application (SPA) built with Svelte and the individual elements are dynamically rendered as needed, as many websites are today. Since the HTML is not hard-coded, these sites are notoriously difficult to scrape.
If you just log the response, you will see that the elements for which you are selecting do not exist. This is because it is the browser that interprets the JavaScript at run time and updates the UI. A GET request using got, axios, fetch, whatever, cannot perform such tasks.
You will need to implement the use of a headless browser like Puppeteer in order to dynamically render the site and scrape.

Chain of endpoints in Node and Express: how to prevent that some of them stops all the series?

In some page I have to get information from 8 different endpoints. 2 of them are outside of my application and sometimes they cause an delay at displaying data. The web browser waits until the data is processed. Once they're outside of my app I can't refactor them in order to make them fast, but I need to show the information that they provide. In addition, sometimes one of them returns nothing. If so, I use default data to show to the user. The waiting time takes time for the user experience perspective.
I'm using promises to call these endpoints. Below is part of the code snippet that I am using.
The code is working fine. The issue is the delay.
First. Here is the array that contains all the service that I need to process:
var requests = [{
// 0
url: urlLocalApi + '/endpointURL_1/',
headers: {
'headers': 'apitoken'
},
}, {
// 1
url: urlLocalApi + '/endpointURL_2/',
headers: {
'headers': 'apitoken'
},
];
The code of array is encapsulated in this method:
const requests = homePageFunctions.createRequest();
Now, it is how the data is processed. I am using both 'request-promise' and 'bluebird', and a personal logger to check it out if everything goes fine.
const Promise = require("bluebird");
const request = require('request-promise');
var viewsHelper = {
getPageData: function (requests) {
return Promise.map(requests, function (obj) {
return request(obj).then(function (body) {
AppLogger.log(`Endpoint parsed`, statusLogger.infodate);
return JSON.parse(body);
});
});
}
}
module.exports = viewsHelper;
How do I call this?
viewsHelper.getPageData(requests)
.then(results => {
var output = [];
for (var i = 0; i < results.length; i++) {
output.push(results[i]);
}
// render data
res.render('homepage/index', output);
AppLogger.log(`PageData is rendered`, statusLogger.infodate);
})
.catch(err => {
console.log(err);
});
};
Take a look that inside of each index item of "output" array, there is the output of each data of each endpoint.
The problem here is:
If any of the endpoint takes long, the entire chain slows even though
if they are already processed. The web page waits in a blank mode.
How to prevent this behavior?
That is an interesting question but I have questions in order to answer it effectively.
You have Node server and client (HTML/JS)
You have 8 end points 2 are slow because you don’t have control over them.
Does the client (page) aware of the 8 end points? I .e you make 8 calls everytime you reload the page?
OR
Does the page makes one request to your node JS and your nodeJS synchronously calls the 8 end points
If it is 1 then lazy loading will work easily for you since the page is making the requests.
If it is 2 lazy loading will work only at the server side however the client will be blocked because it doesn’t know (or care how you load your data. The page made one request and it is blocked waiting for that request..
Obviously each method has pros and cons ..
One way you can solve this is to asynchronously call those end points on node and cache them and when the page makes the 1 request you have the data ready ..
Again we know very little about the situation there are many ways to solve this
Hope this helps

how to refresh page when hitting api endpoint (nuxt.js)

I'm making an app that displays a list of orders. The problem is when I submit a new order by hitting an endpoint with a post request with the data for a new order, the page or the components need to refresh automatically to display this new order. I don't know how this is achieved with nuxt. The client-side HTML rendering needs to be actively reacting to events happening on the server-side.
So, you don't actually need to refresh the page to do this, you can use axios to POST an order and update your page with that particular order
export default {
data(){
return {
newOrder: null
}
},
methods:{
newOrder(){
const newOrder = await this.$axios.post(...)
this.newOrder = newOrder
},
}
}

Scrolling in browser cause slow fetch(url) responses

[context] - this turned out to be irrelevant, the issue at hand is a client side thing
I'm experiencing some strange response times from my NodeJS/Express app.
According to the logs, the requests complete in 180-220 MS
But from the web client perspective, I'm seeing these numbers which are very strange, some are in the range of seconds. and the payload is not that big. its around 1.5k.
I've disabled every feature I could suspect, Sessions, AD Authentication etc.
To add to the confusion, the overhead is only there sometimes, maybe 30% of the requests. others complete in the same time range as listed in the first picture.
I wish I could give more context but it's a really simple React frontend using Fetch to HTTP GET data from the /contacts endpoint.
Nothing more.
[EDIT 1]
This seems to be a client side thing.
The whole process is part of a virtual/infinite scroll React component.
If I do scroll down using the down arrow key, response times stay normal.
If I scroll really fast using the scrollbar/touch, the response time goes up.
Important to know is that the services do not return more data, it is always 20 rows at a time, yet the response time increases.
So, does scrolling in a browser somehow prevent promises or other more native constructs from completing?
[EDIT 2]
Same behavior across Chrome, Safari, Firefox.
Scrolling fast seems to make the request unable to complete in a timely manner.
Upgrading to React 16 seems to have improved the issue, or maybe just false positives.
[EDIT 3]
Even when replacing the call to the backend to a static JSON service online. the behavior still persists, so this has nothing to do with Express or NodeJS.
There is something odd about scrolling and React+Fetch.
[EDIT 4]
Client side code snippets:
Event listeners
this.scrollHandler = this.checkWindowScroll;
this.resizeHandler = this.checkWindowScroll;
window.addEventListener("scroll", this.scrollHandler, { passive: true });
window.addEventListener("resize", this.resizeHandler, { passive: true });
ScrollHandler
//check if we have scrolled to the bottom of the screen
checkWindowScroll = () => {
if (this.state.loading) {
return;
}
const trigger = 800;
const pageBottom =
window.document.body.getBoundingClientRect().height - trigger;
const scrollBottom = window.pageYOffset + window.innerHeight;
if (scrollBottom > pageBottom) {
this.loadMore();
}
};
Fetch Next Data
//responsible for fetching another chunk of data from a backend
async loadMore() {
this.setState({ loading: true, error: undefined }); // begin load
const items = await this.props.fetchData(
this.search,
this.state.items.length
);
this.setState(
{
loading: false,
error: undefined,
items: [...this.state.items, ...items] // clone
},
() => this.checkWindowScroll() // once state is updated, check again if we need more data
);
}

Modify HTTP responses from a Chrome extension

Is it possible to create a Chrome extension that modifies HTTP response bodies?
I have looked in the Chrome Extension APIs, but I haven't found anything to do this.
In general, you cannot change the response body of a HTTP request using the standard Chrome extension APIs.
This feature is being requested at 104058: WebRequest API: allow extension to edit response body. Star the issue to get notified of updates.
If you want to edit the response body for a known XMLHttpRequest, inject code via a content script to override the default XMLHttpRequest constructor with a custom (full-featured) one that rewrites the response before triggering the real event. Make sure that your XMLHttpRequest object is fully compliant with Chrome's built-in XMLHttpRequest object, or AJAX-heavy sites will break.
In other cases, you can use the chrome.webRequest or chrome.declarativeWebRequest APIs to redirect the request to a data:-URI. Unlike the XHR-approach, you won't get the original contents of the request. Actually, the request will never hit the server because redirection can only be done before the actual request is sent. And if you redirect a main_frame request, the user will see the data:-URI instead of the requested URL.
I just released a Devtools extension that does just that :)
It's called tamper, it's based on mitmproxy and it allows you to see all requests made by the current tab, modify them and serve the modified version next time you refresh.
It's a pretty early version but it should be compatible with OS X and Windows. Let me know if it doesn't work for you.
You can get it here http://dutzi.github.io/tamper/
How this works
As #Xan commented below, the extension communicates through Native Messaging with a python script that extends mitmproxy.
The extension lists all requests using chrome.devtools.network.onRequestFinished.
When you click on of the requests it downloads its response using the request object's getContent() method, and then sends that response to the python script which saves it locally.
It then opens file in an editor (using call for OSX or subprocess.Popen for windows).
The python script uses mitmproxy to listen to all communication made through that proxy, if it detects a request for a file that was saved it serves the file that was saved instead.
I used Chrome's proxy API (specifically chrome.proxy.settings.set()) to set a PAC as the proxy setting. That PAC file redirect all communication to the python script's proxy.
One of the greatest things about mitmproxy is that it can also modify HTTPs communication. So you have that also :)
Like #Rob w said, I've override XMLHttpRequest and this is a result for modification any XHR requests in any sites (working like transparent modification proxy):
var _open = XMLHttpRequest.prototype.open;
window.XMLHttpRequest.prototype.open = function (method, URL) {
var _onreadystatechange = this.onreadystatechange,
_this = this;
_this.onreadystatechange = function () {
// catch only completed 'api/search/universal' requests
if (_this.readyState === 4 && _this.status === 200 && ~URL.indexOf('api/search/universal')) {
try {
//////////////////////////////////////
// THIS IS ACTIONS FOR YOUR REQUEST //
// EXAMPLE: //
//////////////////////////////////////
var data = JSON.parse(_this.responseText); // {"fields": ["a","b"]}
if (data.fields) {
data.fields.push('c','d');
}
// rewrite responseText
Object.defineProperty(_this, 'responseText', {value: JSON.stringify(data)});
/////////////// END //////////////////
} catch (e) {}
console.log('Caught! :)', method, URL/*, _this.responseText*/);
}
// call original callback
if (_onreadystatechange) _onreadystatechange.apply(this, arguments);
};
// detect any onreadystatechange changing
Object.defineProperty(this, "onreadystatechange", {
get: function () {
return _onreadystatechange;
},
set: function (value) {
_onreadystatechange = value;
}
});
return _open.apply(_this, arguments);
};
for example this code can be used successfully by Tampermonkey for making any modifications on any sites :)
Yes. It is possible with the chrome.debugger API, which grants extension access to the Chrome DevTools Protocol, which supports HTTP interception and modification through its Network API.
This solution was suggested by a comment on Chrome Issue 487422:
For anyone wanting an alternative which is doable at the moment, you can use chrome.debugger in a background/event page to attach to the specific tab you want to listen to (or attach to all tabs if that's possible, haven't tested all tabs personally), then use the network API of the debugging protocol.
The only problem with this is that there will be the usual yellow bar at the top of the tab's viewport, unless the user turns it off in chrome://flags.
First, attach a debugger to the target:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
// TODO
});
});
Next, send the Network.setRequestInterceptionEnabled command, which will enable interception of network requests:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
});
Chrome will now begin sending Network.requestIntercepted events. Add a listener for them:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
chrome.debugger.onEvent.addListener((source, method, params) => {
if(source.targetId === target.id && method === "Network.requestIntercepted") {
// TODO
}
});
});
In the listener, params.request will be the corresponding Request object.
Send the response with Network.continueInterceptedRequest:
Pass a base64 encoding of your desired HTTP raw response (including HTTP status line, headers, etc!) as rawResponse.
Pass params.interceptionId as interceptionId.
Note that I have not tested any of this, at all.
While Safari has this feature built-in, the best workaround I've found for Chrome so far is to use Cypress's intercept functionality. It cleanly allows me to stub HTTP responses in Chrome. I call cy.intercept then cy.visit(<URL>) and it intercepts and provides a stubbed response for a specific request the visited page makes. Here's an example:
cy.intercept('GET', '/myapiendpoint', {
statusCode: 200,
body: {
myexamplefield: 'Example value',
},
})
cy.visit('http://localhost:8080/mytestpage')
Note: You may also need to configure Cypress to disable some Chrome-specific security settings.
The original question was about Chrome extensions, but I notice that it has branched out into different methods, going by the upvotes on answers that have non-Chrome-extension methods.
Here's a way to kind of achieve this with Puppeteer. Note the caveat mentioned on the originalContent line - the fetched response may be different to the original response in some circumstances.
With Node.js:
npm install puppeteer node-fetch#2.6.7
Create this main.js:
const puppeteer = require("puppeteer");
const fetch = require("node-fetch");
(async function() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
})();
Run it:
node main.js
With Deno:
Install Deno:
curl -fsSL https://deno.land/install.sh | sh # linux, mac
irm https://deno.land/install.ps1 | iex # windows powershell
Download Chrome for Puppeteer:
PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/puppeteer#16.2.0/install.ts
Create this main.js:
import puppeteer from "https://deno.land/x/puppeteer#16.2.0/mod.ts";
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
Run it:
deno run -A --unstable main.js
(I'm currently running into a TimeoutError with this that will hopefully be resolved soon: https://github.com/lucacasonato/deno-puppeteer/issues/65)
Yes, you can modify HTTP response in a Chrome extension. I built ModResponse (https://modheader.com/modresponse) that does that. It can record and replay your HTTP response, modify it, add delay, and even use the HTTP response from a different server (like from your localhost)
The way it works is to use the chrome.debugger API (https://developer.chrome.com/docs/extensions/reference/debugger/), which gives you access to Chrome DevTools Protocol (https://chromedevtools.github.io/devtools-protocol/). You can then intercept the request and response using the Fetch Domain API (https://chromedevtools.github.io/devtools-protocol/tot/Fetch/), then override the response you want. (You can also use the Network Domain, though it is deprecated in favor of the Fetch Domain)
The nice thing about this approach is that it will just work out of box. No desktop app installation required. No extra proxy setup. However, it will show a debugging banner in Chrome (which you can add an argument to Chrome to hide), and it is significantly more complicated to setup than other APIs.
For examples on how to use the debugger API, take a look at the chrome-extensions-samples: https://github.com/GoogleChrome/chrome-extensions-samples/tree/main/mv2-archive/api/debugger/live-headers
I've just found this extension and it does a lot of other things but modifying api responses in the browser works really well: https://requestly.io/
Follow these steps to get it working:
Install the extension
Go to HttpRules
Add a new rule and add a url and a response
Enable the rule with the radio button
Go to Chrome and you should see the response is modified
You can have multiple rules with different responses and enable/disable as required. I've not found out how you can have a different response per request though if the url is the same unfortunately.

Resources