Is there a way to request an internal API of a public website from Node fetch? - node.js

I am trying to scrape dynamic websites and was using Puppeteer with Node.js before I realized i can just fetch the website's API directly and not have to render stuff that I don't need. By looking in the "Network" tab of Chrome's developer tools I could find the exact endpoints that returns the data I need. It works for most of the sites I am trying to scrape, but for some, especially POST requests, the API returns a "403: Forbidden" error code.
The API returns a success if I do a fetch-request directly from the Chrome console. But as soon as I try from a different tab, Postman, or Node using node-fetch I get "403: Forbidden".
I have tried copying the exact headers that are sent naturally from the website, and I have tried explicitly setting the "origin" and "referer" headers to the website's address but to no avail.
Is this simply a security measure that is impossible to breach or is there a way to trick the API into thinking that the request is coming from their own website?

Related

Cannot load webpage from Postman because of javax.faces.ViewState?

I am trying to integrate a web application written by someone else with an API written by someone else. At the moment I am trying to test one of the webpages using Postman. When the webpage is loaded in a browser it works correctly. I have replicated all of the headers and body in Postman, however when I try to launch the webpage in Postman a HTTP 500 status code appears (internal server error).
I think the issue is with: javax.faces.ViewState, which is a body key/value pair. I initially do I get request to the webpage in Postman and get the viewstate:
I tried passing the value: xxxxxxxxxxxxxxxxxxxxxx;yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy in the body key/value pair, but still I get an internal server error. I have also checked that the JSESSIONID cookie is identical in the GET request and the POST request.
I have also noticed that if I access the webpage from a browser, then there is a colon instead of a semi colon in the value if that has any bearing.
Most of what I have tried so far was suggested in the answer to this question: How to programmatically send POST request to JSF page without using HTML form?
What am I doing wrong?

HTTP GET request on website does not return full body (website scraping)

I am trying to scrape a website, but some pages of that website will be fully returned upon a GET request. I don't wan't to disclose the URL of said website but still I'd like to ask for help in this regard.
I've implemented HTTP requests to log into the member area of that website, which works fine. Then, I'd like to get a list of conversations, however, when I compare the response from the firefox developer tools (of the same GET to the same location with the same parameters), I will see the full HTML in firefox dev tools, but in my implementation (using the request nodejs module) I will only see the inner <div id="content">...</div>, without any javascript or surrounding HTML.
How can this be? I understand javascript can inject HTML afterwards, but how should this be possible if no javascript has been received by my scraping implementation? What is different in firefox? I understand that in firefox probably their javascript client is running and doing the GET request, which then inserts content. However, the firefox log shows a HTTP GET request (no XHR) and it shows the full response in the dev tools. How is this possible?
Anyone got a hint on how to proceed on this further?

Get network data with nodejs

I am looking for a way to get complete data from the network of my own website with nodejs, exactly like in the network tab of the chrome devtool.
But, surprinsingly, I cannot find any information about it.
What I need is actually to get data from requests made by an iframe located on my webpage. But the tricky thing is that these specific requests don`t pass by my server. Is there a way to access these requests as I would in the network tab of chrome?
The only way that I found is to create a chrome plugin, it takes some effort but then I could catch any request.

How do I override xhr-src?

I created a userscript for myself which is active on all webpages i visit. It sends data to my debugger/app via jquery's post ($.post).
I notice one site not allowing me to send data even though it worked before and after a quick look it appears there is some kind of error via xhr-src. It appears the response headers has a 'X-Content-Security-Policy' which list a bunch of sites (google being one). So when i try to do a post to localhost:myport/ it violates the rule thus doesn't post.
What can I do to get this working again? I can't exactly edit the headers (unless i write my own http proxy?) would i be able to create an iframe using localhost:1234/workaround and post via that? But the issue is i still dont know if thats a violation or how to give it data.

Why does new Facebook Javascript SDK not violate the "same origin policy"?

The new Facebook Javascript SDK can let any website login as a Facebook user and fetch data of a user...
So it will be, www.example.com including some Javascript from Facebook, but as I recall, that script is considered to be of the origin of www.example.com and cannot fetch data from facebook.com, because it is a violation of the "same origin policy". Isn't that correct? If so, how does the script fetch data?
From here: https://developer.mozilla.org/en/Same_origin_policy_for_JavaScript
The same origin policy prevents a
document or script loaded from one
origin from getting or setting
properties of a document from another
origin. This policy dates all the way
back to Netscape Navigator 2.0.
and explained slightly differently here: http://docs.sun.com/source/816-6409-10/sec.htm
The same origin policy works as
follows: when loading a document from
one origin, a script loaded from a
different origin cannot get or set
specific properties of specific
browser and HTML objects in a window
or frame (see Table 14.2).
The Facebook script is not attempting to interact with script from your domain or reading DOM objects. It's just going to do its own post to Facebook. It gets yous site name, not by interacting with your page, or script from your site, but because the script itself that is generated when you fill out the form to get the "like" button. I registered a site named "http://www.bogussite.com" and got the code to put on my website. The first think in this code was
iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.bogussite.com&
so the script is clearly getting your site info by hard-coded URL parameters in the link to the iFrame.
Facebook's website is by far not alone in having you use scripts hosted on their servers. There are plenty of other scripts that work this way.. All of the Google APIs, for example, including Google Gears, Google Analytics, etc require you to use a script hosted on their server. Just last week, while I was trying to figure out how to do geolocation for our store finder for a mobile-friendly web app, I found a whole slew of geolocation services that had you use scripts hosted on their servers, rather than copying the script to your server.
I think, but am not sure, that they use the iframe method. At least the cross domain receiver and xfbml stuff for canvas apps uses that. Basically the javascript on your page creates an iframe within the facebook.com domain. That iframe then has permission to do whatever it needs with facebook. Communication back with the parent can be done with one of several methods, for example the url hash. But I'm not sure which if any method they use for that part.
If I recall, they use script tag insertion. So when a JS SDK call needs to call out to Facebook, it inserts a <script src="http://graph.facebook.com/whatever?params...&callback=some_function script tag into the current document. Then Facebook returns the data in JSON format as some_function({...}) where the actual data is inside the ... . This results in the function some_function being called in the origin of example.com using data from graph.facebook.com.

Resources