How web browsers decide which resource should be requested - browser

I have a fundamental question and I am searching for that for a long but I still don't know the exact response for that.
I am working with browsers and web applications. I am wondering how and based on what a web browser decide to send a particular request to the web server.
For example when you enter http://www.google.com inside the address bar of your web browser. the Browser will send a bunch of request to the web server for rendering the web page properly.
Now, my question is that how the web browser decide which request it needs to send to the web server.
does it related to some tags like 'link' or 'script' inside the body of the responses.
does the browser parse the javascript functions to see if it should send a request based on those functions?

Lets take an example to explain this one.
Consider you want to search for something and you hit http://www.google.com on your browser. These are the events that unfold to fetch you the page that will let you type in your query.
First, the networking stack on your machine will try to figure out which actual internet address matches www.google.com. This is called a DNS lookup. Once it receives a response for this lookup in form of an IP address, it can make a connection to the actual server that is serving google.com.
The machine makes a socket connection and uses the HTTP protocol to communicate with the server. It queries for the resource at / (which is the root) of the address you are trying to reach. This is called a GET request. The request is normally described like so: GET /
Google will respond with an HTML page. normally "index.html", which gets downloaded by the browser.
Once the HTML is downloaded, all linked resources, such as images to render the HTML as well as javascript referenced by the HTML page gets downloaded.
The downloaded HTML page is parsed and an in-memory tree is created called the "DOM Tree". This tree contains the elements of the HTML page in a hierarchy. Once the DOM is created, you can see the page being rendered on the browser.
During this parsing, the browser discovers more resources to be downloaded, such as images, stylesheets, javascript files. The HTML page references these resources via different tags such as <img> for images, <script> for javascript.
All detected resources are downloaded. Browsers download many of these resources in parallel, but apply them (javascript and stylesheets) sequentially in the order they where found on the page.
Stylesheets are parsed, and the styles are applied to the DOM of the HTML page. Sometimes, if stylesheets take longer to download, you can see the "raw" HTML page being rendered before the styles are applied. This happens sometimes over a slow connection.
Once the HTML page and related javascript files have been downloaded, the browser calls the "onload" callback function of javascript. Most Javascript heavy applications are started at this time.
Once onload is called, Javascript takes over and can attach handlers for different elements on the web page. Once the handlers have all been installed, interacting with the webpage could call one or more javascript functions that are listening for these events.
Javascript can also manipulate the DOM (the elements on the page), which results in UI updates (what the user sees) and therefore can be used to build a complete app on a single page.
Here is some more reading on the process: http://friendlybit.com/css/rendering-a-web-page-step-by-step/
The best way to examine this interaction is to use Developer tools on Chrome/FireFox or IE and view the network activity when you visit a web page.

Related

HTTP GET request on website does not return full body (website scraping)

I am trying to scrape a website, but some pages of that website will be fully returned upon a GET request. I don't wan't to disclose the URL of said website but still I'd like to ask for help in this regard.
I've implemented HTTP requests to log into the member area of that website, which works fine. Then, I'd like to get a list of conversations, however, when I compare the response from the firefox developer tools (of the same GET to the same location with the same parameters), I will see the full HTML in firefox dev tools, but in my implementation (using the request nodejs module) I will only see the inner <div id="content">...</div>, without any javascript or surrounding HTML.
How can this be? I understand javascript can inject HTML afterwards, but how should this be possible if no javascript has been received by my scraping implementation? What is different in firefox? I understand that in firefox probably their javascript client is running and doing the GET request, which then inserts content. However, the firefox log shows a HTTP GET request (no XHR) and it shows the full response in the dev tools. How is this possible?
Anyone got a hint on how to proceed on this further?

Using NodeJS functions in html

So I have made a back-end in NodeJS but I ran into one problem, how is it possible to link my back-end to my front end html/css page and use my NodeJS functions as scripts?
In case this wasn't clear to you, your nodejs back-end runs on your server. The server's job (in a webapp) is to deliver data to the browser. It delivers HTML pages. It delivers resources referenced in those HTML pages such as scripts, images, fonts, style sheets, etc.. It can answer programmatic requests for data also.
The scripts in those web pages run inside the browser which is nearly always (except for some developer testing scenarios) running on a completely different computer on a completely different local network (only connected via some network - usually the internet).
As such, a script in the browser cannot directly reference variables that exist in the server or call functions that exist on the server. They are completely different computers.
The kinds of things you can do in order to work-around this architectural limitation are as follows:
The server can dynamically modify the web page it is sending the browser and it can insert data into that web page. That data can be in the form of already rendered HTML or it can be variables inside of script tags that your web page Javascript can then use.
The javascript in the web page can make network requests to your server asking it for data. These are often called AJAX calls. In this scenario, some Javascript in your page sends a request to the server to retrieve some data or cause some action on the server. The server receives that request, carries out the desired operation and then returns a result back to the client Javascript running in the browser. That client Javascript receives the result and can then act on it, inserting data into the page, making the browser go to a new web page, prompting the user, etc...
There are some other ways that the web page javascript can communicate with the server such as webSocket connections, but we'll put those aside for now as they are just more ways for remote communication to happen - the structure of the communication doesn't really change.
how is it possible to link my back-end to my front end html/css page and use my NodeJS functions as scripts?
You can't directly use your nodejs functions as scripts in the front-end. You can make Ajax calls to the server and ask the server to execute it's own server code on your behalf to carry out some operation or retrieve some data.
If appropriate, you can also insert scripts into the web page and run Javascript directly in the browser, but whether you can do that for your particular situation depends entirely upon what the scripts are doing. If the scripts are accessing some resource that is only available from the server (like a database or a server storage system), then you won't be able to run those types of scripts in the browser. You will have to use ajax calls to ask the server to run them for you and then retrieve the results.

Is my picture of a website correct?

I tried analyzing what in essence is a website . I thought of deconstructing or reverse engineering a website . The following are the sequence of events, I speculate or theorize the following sequence of events to be taking place during interaction with a website .
1.Every website is basically a set of computer programs,which get executed when the system where they are stored are contacted .
2.Depending on the processing of the type of request sent by the sender , some xml files , files containing the code to be executed,in response to different events and some script purported for dynamic alteration of the xml files are sent. Out of these xml files .
Out of these xml files , one contains the information about the initial appearance of the page and the furnishing of different controls or event generators on the screen .
4.So when some activity is done in the locality of one event generator , like a mouse click , an event is generated .
The code snippet to respond to the event is executed . If the code contains contacting the server and sending some request then the server is again pinged .
When the server is pinged again , depending on the request sent it again executes some code and in response transfers some more code files ,xml files and scripts to dynamically change the appearance of the page .
Is my understanding about the flow of a website correct ?
A web server is basically just a program sitting on a computer that listens on some TCP port (usually 80 for HTTP, 443 for HTTPS).
Clients (such as browsers) can connect and send a request (in HTTP format) to the server.
The server then sends an HTTP response back.
That's it. That's the basic flow: Connect, request, response.
The response contains a "type" field that tells the client what to do with the data. E.g. it could send an image (which is usually displayed on screen), an audio file (which is played), or a "normal" web page in HTML format.
HTML contains structured information about page content and layout, and may contain references to other resources such as images, style sheets, and scripts. A browser automatically fetches these resources (another HTTP request/response) and processes them.
Scripts can be used to customize the behavior on the client side. These are typically written in JavaScript and make use of an API exposed by the browser for interacting with the current page. They can e.g. register "click" handlers to define what happens when the user clicks on some page element.
XML may or may not be used internally by the web server. It doesn't really matter as far as clients are concerned.
If you want to learn more about this, I suggest researching HTTP, HTML, CSS, and JavaScript. MDN has some good articles, for example.

Submit form via headless browser - NodeJS

Due to an archaic stack, the response from a form submission is HTML. When rendered on my clients domain, the relevant information is injected in via a portlet. If you render the markup locally, the necessary content is missing. This makes it impossible for me to simply post data to the relevant form endpoint.
As a result of this, I need to submit a form and scrape the success/fail page for the necessary information in a headless browser.
I'm planning on wiring an API endpoint in my NodeJS application that I can post the form data to which in turn will submit the form in the headless browser and respond with the scraped content.
Are there any frameworks that would support this? I've looked at Nightwatch and Web Driver but they all seem to be aimed at automated testing rather than what I'm after.
Try using casper.js
It is a scripting and testing utility for use with Slimer.js (headless browser)

Node/Express app find onhashchange event handler

I'm working on a Node.js website, I've taken the work on for a charity and I confess I'm learning on the job.
The page in question starts with content rendered, but invisible. When you click a button that redirects to a URL starting with a # ( which means it gets appended to the page ), no get occurs, but the content is revealed . The issue is, it needs to be filtered. However, I cannot figure out what is triggering this. The word 'hashchange' does not occur in the code base. The window.onhashchange event is null. Where would I look to try to track down the code that is doing this ?
The content after the hash mark is called an URL fragment. URL fragments are not sent to the server and appending an URL fragment does not typically invoke a page fetch, so it makes sense that no get occurs.
URL fragments are commonly used to keep track of navigation state on the browser side. This is common with single page apps (SPA) that will only fetch the entire page from the server once, and handle the rest of navigation using javascript, pushState, and AJAX queries.
This is presumably what is happening when you navigate to different tabs. The client side javascript is appending URL fragment in order to push state onto the browser history without forcing an unnecessary page reload. Note that this code does not need to listen to the onhashchange event in order for this to work, which is why you don't see any mention of it in your code search.

Resources