Modifying content delivered by node-http-proxy

Modifying content delivered by node-http-proxy - node.js

Due to some limitations about the web services I am proxying, I have to inject some JS code so that it allows the iframe to access the parent window and perform some actions.
I have built a proxy system with node-http-proxy which works pretty nicely. However I have spent unmeasurable hours trying to modify the content (on my own, using harmon as well, etc) that is being sent to the user without any success. I have found some articles and even some questions here but all of them are outdated and are not useful anymore.
I was wondering if someone can give me an actual example about how to do this, because I am unable to do it and maybe it is just that it is impossible to do at this point?

I haven't tried harmon, but I did try cheerio and it works.
However, I used http-mitm-proxy and not node-http-proxy.
If you are using http-mitm-proxy, you need to return a promise in the response handler. Otherwise, the proxy continues to send the original response without picking up your changes.
I have recently written another proxy at:
https://github.com/noeltimothy/noelsproxy
I'm going to add response handling to this soon. This one uses a callback mechanism, which means it wont return the response until the caller signals it to.
You should be able to use 'cheerio' and alter the content in JQuery style.

Related

Tracking Discord with GA4

Bots are amazing, unless you're Google Analytics
After many months of learning to host my own Discord bot, I finally figured it out! I now have a node server running on my localhost that sends and receives data from my Discord server; it works great. I can do all kinds of the things I want to with my Discord bot.
Given that I work with analytics everyday, one project I want to figure out is how to send data to Google Analytics (specifically GA4) from this node server.
NOTE: I have had success in sending data to my Universal Analytics property. However, as awesome as that was to finally see pageviews coming into, it was equally heartbreaking to recall that Google will be getting rid of Universal Analytics in July of this year.
I have tried the following options:
GET/POST requests to the collect endpoint
This option presented itself as impossible from the get-go. In order to send a request to the collection endpoint, a client_id must be sent along with the request itself. And this client_id is something that must be generated using Google's client id algorithm. So, I can't just make one up.
If you consider this option possible, please let me know why.
Install googleapis npm package
At first, I thought I could just install the googleapis package and be ready to go, but that idea fell on its face immediately too. With this package, I can't send data to GA, I can only read with it.
Find and install a GTM npm package
There are GTM npm packages out there, but I quickly found out that they all require there to be a window object, which is something my node server would not have because it isn't a browser.
How I did this for Universal Analytics
My biggest goal is to do this without using Python, Java, C++ or any other low level languages. Because, that route would require me to learn new languages. Surely it's possible with NodeJS alone... no?
I eventually stumbled upon the idea of actually hosting a webpage as some sort of pseudo-proxy that would send data from the page to GA when accessed by something like a page scraper. It was simple. I created an HTML file that has Google Tag Manager installed on it, and all I had to do was use the puppeteer npm package.
It isn't perfect, but it works and I can use Google Tag Manager to handle and manipulate input, which is wonderful.
Unfortunately, this same method will not work for GA4 because GA4 automatically excludes all identified bot traffic automatically, and there is no way to turn that setting off. It is a very useful feature for GA4, giving it quite a bit more integrity than UA, and I'm not trying to get around that fact, but it is now the Bane of my entire goal.
https://support.google.com/analytics/answer/9888366?hl=en
Where to go from here?
I'm nearly at the end of my wits on figuring this one out. So, either an npm package exists out there that I haven't found yet, or this is a futile project.
Does anyone have any experience in sending data from NodeJS to GA4? (or even GTM?) How did you do it?

...and this client_id is something that must be generated using Google's client id algorithm. So, I can't just make one up...
Why, of course you can. GA4 generates it pretty much the same as UA does. You don't need anything from google to do it.
Besides, instead of mimicking just requests to the collect endpoint, you may just wanna go the MP route right away: https://developers.google.com/analytics/devguides/collection/protocol/ga4 The links #dockeryZ gave, work perfectly fine. Maybe try opening them in incognito, or in a different browser? Maybe you have a plugin blocking analytics urls.
Moreover, you don't really need to reinvent the bicycle. Node already has a few packages to send events to GA4, here's one looking good: https://www.npmjs.com/package/ga4-mp?activeTab=readme
Or you can just use gtag directly to send events. I see a lot of people doing it even on the front-end: https://www.npmjs.com/package/ga-gtag Gtag has a whole api not described in there. Here's more on gtag: https://developers.google.com/tag-platform/gtagjs/reference Note how the library allows you to set the client id there.
The only caveat there is that you'll have to track client ids and session ids manually. Shouldn't be too bad though. Oh, and you will have to redefine the concept of a pageview, I guess. Well, the obvious one is whenever people post in the chan that is different from the previous post in a session. Still, this will have to be defined in the code.
Don't worry about google's bot traffic detection. It's really primitive. Just make sure your useragent doesn't scream "bot" in it. Make something better up.

Express response interception. Use body to modify the outgoing headers

I would like to try and improve site render times by making use of preload/push headers.
We have various assets which are required up front that I would like to preload, and various assets which are marked up in data attributes etc which will be required later via JS but not for initial paint. It would be good to get these flowing to the client early.
Our application is a bit of a hybrid, it uses http-proxy-middleware connected to various different applications, plus directly renders pages it self. I would like the middleware to be agnostic and work regardless of how to page is produced.
I've seen express-mung but this doesn't hold back the header so executes too late, and works with chunked buffers anyway not the entire response. Next up was express-interceptor, that works perfectly for pages rendered directly in express but causes request failures for pages run through the proxy. My next best idea is pulling apart the compression module to figure out how it works.
Does anyone have a better suggestion, or even better know of a working module for this kind of thing?
Thanks.

How HTTP response is generated

I'm fairly new to programming and this question is about making sure I get the HTTP protocol correctly. My issue is that when I read about HTTP request/response, it looks like it needs to be in a very specific format with a status code, HTTP version number, headers, a blank line followed by the body.
However, after creating a web app with nodejs/express, I never once had to actually write code that made an HTTP response in this format (I'm assuming, although I don't know for sure that other frameworks like ruby on rails or python/Django are the same). In the express app, I just set up the route handlers to render the appropriate pages, when a request was made to that route.
Is this because express is actually putting the response in the correct HTTP format behind the scenes? In other words, if I looked at the expressJS code, would there be something in that code that actually makes an HTTP response in the HTTP format?
My confusion is that, it seems like the HTTP request/response format is so important but somehow I never had to write any code dealing with it for a node/express application. Maybe this is the entire point of a framework like express... to take out the details so that developers can deal with business logic. And if that is correct, does anyone ever write web apps without a framework to do this. Would you then be responsible for writing code that puts the server's response into the exact HTTP format?

I'm fairly new to programming and this question is about making sure I get the HTTP protocol correctly. My issue is that when I read about HTTP request/response, it looks like it needs to be in a very specific format with a status code, HTTP version number, headers, a blank line followed by the body.
Just to give you an idea, there are probably hundreds of specifications that have something to do with the HTTP protocol. They deal with not only the protocol itself, but also with the data format/encoding for everything you send including headers and all the various content types you can send, authentication schemes, caching, status codes, URL decoding, etc.... You can see some of the specifications involved just by looking here: https://www.w3.org/Protocols/.
Now a simple request and a simple text response could get away with only knowing a few of these specifications, but life is not always that simple.
Is this because express is actually putting the response in the correct HTTP format behind the scenes? In other words, if I looked at the expressJS code, would there be something in that code that actually makes an HTTP response in the HTTP format?
Yes, there would. A combination of Express and the HTTP library that is built into node.js handle all the details of the specification for you. That's the advantage of using a library/framework. They even handle different versions of the protocol and feedback from thousands of other developers have helped them to clean up edge case bugs. A good library/framework allows you to still control any detail about the response (headers, content types, status codes, etc..) without making you have to go through the detail work of actually creating the exact response. This is a good thing. It lets you write code faster and lets you ride on the shoulders of others who have already figured out minutiae details that have nothing to do with the logic of your app.
In fact, one could say the same about the TCP protocol below the HTTP protocol. No regular app developer wants to write their own TCP stack. Instead, you just want a working TCP stack that you can use that's already been tuned and debugged for you.
However, after creating a web app with nodejs/express, I never once had to actually write code that made an HTTP response in this format (I'm assuming, although I don't know for sure that other frameworks like ruby on rails or python/Django are the same). In the express app, I just set up the route handlers to render the appropriate pages, when a request was made to that route.
Yes, this is a good thing. The framework did the detail work for you. You just call res.setHeader(), res.status(), res.cookie(), res.send(), res.json(), etc... and Express makes the entire response for you.
And if that is correct, does anyone ever write web apps without a framework to do this. Would you then be responsible for writing code that puts the server's response into the exact HTTP format?
If you didn't use a framework or library of any kind and were programming at the raw TCP level, then yes you would be responsible for all the details of the HTTP protocol. But, hardly anybody other than library developers ever does this because frankly it's just a waste of time. Every single platform has at least one open source library that does this already and even if you were working on a brand new platform, you could go get an open source body of code and port it to your platform much quicker than you could write all this yourself.
Keep in mind that one of the HUGE advantages of node.js is that there's an enormous body of open source code (mostly in NPM and Github) already prepackaged to work with node.js. And, because node.js is server-side where code memory isn't usually tight and where code just comes from the local hard disk at server init time, there's little downside to grabbing a working and tested package that does what you already need, even if you're only going to use 5% of the functionality in the package. Or, worst case, clone an existing repository and modify it to perfectly suit your needs.

Is this because express is actually putting the response in the
correct HTTP format behind the scenes?
Yes, exactly, HTTP is so ubiquitous that almost all programming languages / frameworks handle the actual writing and parsing of HTTP behind the scenes.
Does anyone ever write web apps without a framework to do this. Would
you then be responsible for writing code that puts the server's
response into the exact HTTP format?
Never (unless you're writing code that needs very low level tweaking of HTTP code or something)

Server side response to allow client side routing

I am developing a single page application that has a client side router. so although the base url to run the application will be either http:://example.com or http:://example.com/index.html - skipping the domain name that is routes '/' and '/index.html'
But somewhere in my application, because of my client side router, I may call up a route something like '/appointments/20160113 and the client router will redirect me to the appropriate "Appointments Page" inside my SPA passing the parameter of todays date.
But if the user calls directly http://example.com/appointments/20160113, I am led to believe that the server should respond directly with /index.html so the browser doesn't get a 404.
I am building the server using nodejs (specifically the http2 module, but I don't think that is very important, and my examples don't use https, although with http2 they do). I just tried changing the server so if its hit with an unknown url it responds with the index.html file.
However, the browser sees this as its first response and then makes requests for the rest of its attached files relative to the url (so for instance follows up with /appointments/20160113/styles/main.css). This leads to an infinite loop, as the server responds with another copy in index.html (and immediately gets a request back for /appointments/20160113/styles/styles/main.css ).
Its too early in the lifecycle of the page for the javascript to be running yet (and specifically the router software), so clearly the approach is too simplistic.
I have also written the client side router (using the history api) so I can change things if I need to.
How should this situation be handled. I am thinking perhaps a 301 redirect to /index.html or something and then the router's initial dispatch knows this and can do a popstate or something. I ideally want to support the passing of urls via external means between users, but until I actually tried to implement it I hadn't realise the implications.

I don't know if this is the best way or not, but having not received any answers on here, I decided to try a number of different ways and see which worked out the best.
Each way involved doing a 301 redirect to /index.html, and then providing the url from which I was redirecting via different mechanisms
This is what I tried
Setting a cookie with a short expiry date the value of which was the url
Adding a query string with a ?redirect= parameter with the url
Adding a #fragment after /index.html with the url
In the end I rejected 1) because chrome wasn't deleting the cookie after I had used it and making the value shorted lived depends on accurate timing between client and server. The solution appeared too fragile.
I tried 2) and it was looking good until I came to test it. Unfortunately setting window.location.search causes a page reload, and I was really struggling with finding out what was happening. However, what I discovered in 3) about mocking could well be provided to a solution based on 2) so it is one that could be used. I still might return to this solution as it "feels" right to me.
I tried 3) and it worked quite well. However I was struggling with timing issues in testing since my router element was using the #fragment during initialisation, but I couldn't set the window.location.hash until after the router was established in the test suite. I wanted to mock window.location.hash with sinon so I could control it, but it turns out you can't
The solution to this was for the router to wrap its own calls to window.location.hash in a library, so that I could mock the library. And that is what I did in the end and it all worked.
I could go back to using a query string and wrapping window.location.search in a library call, so I could stub that call and avoid the problems of page reloading.

How do I override xhr-src?

I created a userscript for myself which is active on all webpages i visit. It sends data to my debugger/app via jquery's post ($.post).
I notice one site not allowing me to send data even though it worked before and after a quick look it appears there is some kind of error via xhr-src. It appears the response headers has a 'X-Content-Security-Policy' which list a bunch of sites (google being one). So when i try to do a post to localhost:myport/ it violates the rule thus doesn't post.
What can I do to get this working again? I can't exactly edit the headers (unless i write my own http proxy?) would i be able to create an iframe using localhost:1234/workaround and post via that? But the issue is i still dont know if thats a violation or how to give it data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string