Getting information from http response [closed] - node.js

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 months ago.
Improve this question
This is my first time dealing with web-dev stuff so this may be a stupid question. I am using using axios to send a http request to a certain website, which returns a lot of information. How can I go about extracting the information inside? For example, I am trying to extract the score from the first result in this page (https://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&queryoption=HEADER&facetSearch=true&query=anne+baranger&schoolName=university+of+california+berkeley&dept=chemistry).
From inspecting the request on Postman, I saw that what I need is in a script element of the html page. How can I parse these information? Thanks.

Using Axios
To make a GET request in Axios, you would use:
axios.get('https://domain.tld/path').then(response => {
if (response.success === 200) {
const { data } = response;
// Use `data` here
}
});
In your specific case, your get request does not access a JSON API. If you request accessed a JSON API, then data will be a regular JavaScript object.
In your specific case, you send a GET request to a webpage, which responds with a string of HTML code.
Your Problem
You are trying to scrape data from a site which does not have the data you want. Of course when you access the site yourself, you see that the data is right there. But as soon as the page loads, you will see "Loading...". This is because the site makes a JavaScript call to their internal API to access the search results.
When you make an HTTP request you only get the data that is sent back. No scripts are imported to your own site, no JavaScript is executed, and stylesheets do not render. You are simply stuck with the HTML string. Any data not enclosed in the HTML string will be out of your reach.
A Potential Solution
What you are working towards is actually called web scraping. Web scraping is when you access a webpage, simulate all the scripts that occur during load time, wait for the external page to finish loading, then begin collecting visible data from the page.
To scrape the web, you will need to write a web scraper which runs on a server. Then you will require a headless browser. A popular headless browser is Puppeteer. An alternative is Selenium.
A headless browser is simply Google Chrome, Safari, or FireFox, but without the window. Usually you run your browser in a window, but on your server, since everything is automated, no window needs to be opened, but the JS/DOM just need to be simulated. This will allow scripts to run, stylesheets to be rendered, and all content will be loaded (since this is an actual browser running the site).
Web Scraping Solution
If you have never written a web scraper before, you can use this GitHub repository (JustData) to learn about it.
You can checkout a live example of the data JustData scrapes here.
In the GitHub repository, navigate to server/src/Scraper.

Related

What is the best method for Backend and Frontend communication (incl. SessionID)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
I have a small project running and I question myself, how to do it. I host on two server. One is for Backend and one for Frontend. The Backend-Server is running on NodeJS and as DB MongoDB and I already have my own small API to communicate with it. On the Frontend-Server I am using React. My Question now: What is the best way to make a SessionID on the Frontend and Send it to the Backend-Server over the Frontend-Server?
For example SessionID3 should get displayed a bike on his site and SessionID2 a car (both informations are stored on the Backend-Server on the DB).
I look forward to any replies, thanks!
Most Jamstack setups have an API server that serves content in the form of json/raw (like Github's developer API)in which Nodejs and JavaScript has a built in feature which allows the quick conversion to a JSON object with references in which you can read about that on MDN.
Diagram of Jamstacking with React & Vue
^ since it is actually the client sending the request and doing most of the work and not the front-end server, you can use universal-cookie to set or get a session ID permanently, parse it into JSON and send with the POST request to the Back-end API. You could also do this with Math.random(min, max) as well of you just want a random string of numbers to be sent.
Axios is asynchronous so you want to use async/await or .then() with it so that it works correctly. You can view the official docs at https://axios-http.com/docs/intro. The value returned by Axios is the response in which you will want to parse the "body" the header returned so it can be used.
Another alternative to get the browser session id (which is erased after the browser is closed) is a module called react-session-hook
In my opinion, for what you are trying to do, adding another server that relays the message is not optimal as that adds another sequential connection that makes things take twice as long as it would otherwise. In other-words, this is not necessary. The only time you do this is if the keys to access the API are sensitive meaning they contain important information and musts be hidden.

HTTP GET request on website does not return full body (website scraping)

I am trying to scrape a website, but some pages of that website will be fully returned upon a GET request. I don't wan't to disclose the URL of said website but still I'd like to ask for help in this regard.
I've implemented HTTP requests to log into the member area of that website, which works fine. Then, I'd like to get a list of conversations, however, when I compare the response from the firefox developer tools (of the same GET to the same location with the same parameters), I will see the full HTML in firefox dev tools, but in my implementation (using the request nodejs module) I will only see the inner <div id="content">...</div>, without any javascript or surrounding HTML.
How can this be? I understand javascript can inject HTML afterwards, but how should this be possible if no javascript has been received by my scraping implementation? What is different in firefox? I understand that in firefox probably their javascript client is running and doing the GET request, which then inserts content. However, the firefox log shows a HTTP GET request (no XHR) and it shows the full response in the dev tools. How is this possible?
Anyone got a hint on how to proceed on this further?

How web browsers decide which resource should be requested

I have a fundamental question and I am searching for that for a long but I still don't know the exact response for that.
I am working with browsers and web applications. I am wondering how and based on what a web browser decide to send a particular request to the web server.
For example when you enter http://www.google.com inside the address bar of your web browser. the Browser will send a bunch of request to the web server for rendering the web page properly.
Now, my question is that how the web browser decide which request it needs to send to the web server.
does it related to some tags like 'link' or 'script' inside the body of the responses.
does the browser parse the javascript functions to see if it should send a request based on those functions?
Lets take an example to explain this one.
Consider you want to search for something and you hit http://www.google.com on your browser. These are the events that unfold to fetch you the page that will let you type in your query.
First, the networking stack on your machine will try to figure out which actual internet address matches www.google.com. This is called a DNS lookup. Once it receives a response for this lookup in form of an IP address, it can make a connection to the actual server that is serving google.com.
The machine makes a socket connection and uses the HTTP protocol to communicate with the server. It queries for the resource at / (which is the root) of the address you are trying to reach. This is called a GET request. The request is normally described like so: GET /
Google will respond with an HTML page. normally "index.html", which gets downloaded by the browser.
Once the HTML is downloaded, all linked resources, such as images to render the HTML as well as javascript referenced by the HTML page gets downloaded.
The downloaded HTML page is parsed and an in-memory tree is created called the "DOM Tree". This tree contains the elements of the HTML page in a hierarchy. Once the DOM is created, you can see the page being rendered on the browser.
During this parsing, the browser discovers more resources to be downloaded, such as images, stylesheets, javascript files. The HTML page references these resources via different tags such as <img> for images, <script> for javascript.
All detected resources are downloaded. Browsers download many of these resources in parallel, but apply them (javascript and stylesheets) sequentially in the order they where found on the page.
Stylesheets are parsed, and the styles are applied to the DOM of the HTML page. Sometimes, if stylesheets take longer to download, you can see the "raw" HTML page being rendered before the styles are applied. This happens sometimes over a slow connection.
Once the HTML page and related javascript files have been downloaded, the browser calls the "onload" callback function of javascript. Most Javascript heavy applications are started at this time.
Once onload is called, Javascript takes over and can attach handlers for different elements on the web page. Once the handlers have all been installed, interacting with the webpage could call one or more javascript functions that are listening for these events.
Javascript can also manipulate the DOM (the elements on the page), which results in UI updates (what the user sees) and therefore can be used to build a complete app on a single page.
Here is some more reading on the process: http://friendlybit.com/css/rendering-a-web-page-step-by-step/
The best way to examine this interaction is to use Developer tools on Chrome/FireFox or IE and view the network activity when you visit a web page.

how to perform a post through chrome extention?

How can I perform a post through the chrome extention, lets say I want to send the current tab page title to a webpage
You can do POST XHRs from chrome extensions to any URL, as long as you have host permissions defined in your manifest. See these docs.
In a chrome extension the best way to try and do what i think you want is via a content script see documentation a word of warning however pinging your server with a POST request every time someone with your extension installed opens a web page is going to be extremely heavy going on your servers especially if you have a lot of installs. A possible solution is to use the content script to keep tally of the sites a user visits and save this data in a HTML5 database (wich chrome supports) then using background.html sending the data at given intervals in bulk with an AJAX request, this will significantly cut down the number of times your server is pinged.

How can I pass a message from outside URL to my Chrome Extension?

I know there's a way for extensions and pages to communicate locally, but I need to send a message from an outside URL, have my Chrome Extension listen for it.
I have tried easyXDM in the background page, but it seems to stop listening after awhile, as if Google "turns off" the Javascript in the background page after awhile.
I think you may try some walk around and build a site with some specific data structure, and then implement a content script which will look for this specific that specific data structure, and when i finds one it can fetch the data you want to be passed to your extension.
Yes, you need a content script that communicates with the page using DOM Events.. Instructions on how to do that are here:
http://code.google.com/chrome/extensions/content_scripts.html#host-page-communication

Resources