Due to an archaic stack, the response from a form submission is HTML. When rendered on my clients domain, the relevant information is injected in via a portlet. If you render the markup locally, the necessary content is missing. This makes it impossible for me to simply post data to the relevant form endpoint.
As a result of this, I need to submit a form and scrape the success/fail page for the necessary information in a headless browser.
I'm planning on wiring an API endpoint in my NodeJS application that I can post the form data to which in turn will submit the form in the headless browser and respond with the scraped content.
Are there any frameworks that would support this? I've looked at Nightwatch and Web Driver but they all seem to be aimed at automated testing rather than what I'm after.
Try using casper.js
It is a scripting and testing utility for use with Slimer.js (headless browser)
Related
I am trying to scrape a website, but some pages of that website will be fully returned upon a GET request. I don't wan't to disclose the URL of said website but still I'd like to ask for help in this regard.
I've implemented HTTP requests to log into the member area of that website, which works fine. Then, I'd like to get a list of conversations, however, when I compare the response from the firefox developer tools (of the same GET to the same location with the same parameters), I will see the full HTML in firefox dev tools, but in my implementation (using the request nodejs module) I will only see the inner <div id="content">...</div>, without any javascript or surrounding HTML.
How can this be? I understand javascript can inject HTML afterwards, but how should this be possible if no javascript has been received by my scraping implementation? What is different in firefox? I understand that in firefox probably their javascript client is running and doing the GET request, which then inserts content. However, the firefox log shows a HTTP GET request (no XHR) and it shows the full response in the dev tools. How is this possible?
Anyone got a hint on how to proceed on this further?
So I have made a back-end in NodeJS but I ran into one problem, how is it possible to link my back-end to my front end html/css page and use my NodeJS functions as scripts?
In case this wasn't clear to you, your nodejs back-end runs on your server. The server's job (in a webapp) is to deliver data to the browser. It delivers HTML pages. It delivers resources referenced in those HTML pages such as scripts, images, fonts, style sheets, etc.. It can answer programmatic requests for data also.
The scripts in those web pages run inside the browser which is nearly always (except for some developer testing scenarios) running on a completely different computer on a completely different local network (only connected via some network - usually the internet).
As such, a script in the browser cannot directly reference variables that exist in the server or call functions that exist on the server. They are completely different computers.
The kinds of things you can do in order to work-around this architectural limitation are as follows:
The server can dynamically modify the web page it is sending the browser and it can insert data into that web page. That data can be in the form of already rendered HTML or it can be variables inside of script tags that your web page Javascript can then use.
The javascript in the web page can make network requests to your server asking it for data. These are often called AJAX calls. In this scenario, some Javascript in your page sends a request to the server to retrieve some data or cause some action on the server. The server receives that request, carries out the desired operation and then returns a result back to the client Javascript running in the browser. That client Javascript receives the result and can then act on it, inserting data into the page, making the browser go to a new web page, prompting the user, etc...
There are some other ways that the web page javascript can communicate with the server such as webSocket connections, but we'll put those aside for now as they are just more ways for remote communication to happen - the structure of the communication doesn't really change.
how is it possible to link my back-end to my front end html/css page and use my NodeJS functions as scripts?
You can't directly use your nodejs functions as scripts in the front-end. You can make Ajax calls to the server and ask the server to execute it's own server code on your behalf to carry out some operation or retrieve some data.
If appropriate, you can also insert scripts into the web page and run Javascript directly in the browser, but whether you can do that for your particular situation depends entirely upon what the scripts are doing. If the scripts are accessing some resource that is only available from the server (like a database or a server storage system), then you won't be able to run those types of scripts in the browser. You will have to use ajax calls to ask the server to run them for you and then retrieve the results.
I have a fundamental question and I am searching for that for a long but I still don't know the exact response for that.
I am working with browsers and web applications. I am wondering how and based on what a web browser decide to send a particular request to the web server.
For example when you enter http://www.google.com inside the address bar of your web browser. the Browser will send a bunch of request to the web server for rendering the web page properly.
Now, my question is that how the web browser decide which request it needs to send to the web server.
does it related to some tags like 'link' or 'script' inside the body of the responses.
does the browser parse the javascript functions to see if it should send a request based on those functions?
Lets take an example to explain this one.
Consider you want to search for something and you hit http://www.google.com on your browser. These are the events that unfold to fetch you the page that will let you type in your query.
First, the networking stack on your machine will try to figure out which actual internet address matches www.google.com. This is called a DNS lookup. Once it receives a response for this lookup in form of an IP address, it can make a connection to the actual server that is serving google.com.
The machine makes a socket connection and uses the HTTP protocol to communicate with the server. It queries for the resource at / (which is the root) of the address you are trying to reach. This is called a GET request. The request is normally described like so: GET /
Google will respond with an HTML page. normally "index.html", which gets downloaded by the browser.
Once the HTML is downloaded, all linked resources, such as images to render the HTML as well as javascript referenced by the HTML page gets downloaded.
The downloaded HTML page is parsed and an in-memory tree is created called the "DOM Tree". This tree contains the elements of the HTML page in a hierarchy. Once the DOM is created, you can see the page being rendered on the browser.
During this parsing, the browser discovers more resources to be downloaded, such as images, stylesheets, javascript files. The HTML page references these resources via different tags such as <img> for images, <script> for javascript.
All detected resources are downloaded. Browsers download many of these resources in parallel, but apply them (javascript and stylesheets) sequentially in the order they where found on the page.
Stylesheets are parsed, and the styles are applied to the DOM of the HTML page. Sometimes, if stylesheets take longer to download, you can see the "raw" HTML page being rendered before the styles are applied. This happens sometimes over a slow connection.
Once the HTML page and related javascript files have been downloaded, the browser calls the "onload" callback function of javascript. Most Javascript heavy applications are started at this time.
Once onload is called, Javascript takes over and can attach handlers for different elements on the web page. Once the handlers have all been installed, interacting with the webpage could call one or more javascript functions that are listening for these events.
Javascript can also manipulate the DOM (the elements on the page), which results in UI updates (what the user sees) and therefore can be used to build a complete app on a single page.
Here is some more reading on the process: http://friendlybit.com/css/rendering-a-web-page-step-by-step/
The best way to examine this interaction is to use Developer tools on Chrome/FireFox or IE and view the network activity when you visit a web page.
If I write a chrome extension, it normally consist of multiple parts:
One is the devtools page which is a normal HTML page with origin set to
"chrome-extension://<guid>/filename". On that page I can use
the Dropbox API to get user confirmation via HTML popup and then use
the saved auth info and do all work via the Dropbox javascript library.
Another part of extension is the content script which is executed
in the context of specified third-party web pages ("injected") and have
origin cookies and web storage shared with them.
Is it possible to also use the Dropbox JavaScript library in that content script?
I can't call authenticate in interactive mode since it will re-ask for confirmation for each different webpage I'm injected into. And calling authenticate without interactive will fail since the content script doesn't share the origin, cookies and web storage with the devtools extension page :(. Maybe there's some way to "pass" the Dropbox auth info from the part of the extension that offers GUI and where user successfully confirms dropbox usage to the parts of the extension that are GUI-less, like content script or background page?
I have managed to get Facebook working from code injected into a web app via a content script. I suspect there are multiple ways, but what I did was take advantage of the chrome.identity API to do the OAuth work for me, specifically the launchWebAuthFlow().
This can only be done in the background page (in my case an event page), but I send messages to the event page which replies with the access_token, which can then be used in URLs in the same was as the 'web' technique - i.e. in HTTP requests with XHR.
You can send/receive messages via the content script (using events on document), but I decided to do it directly using "external" messages with the chrome.runtime.sendMessage() API in the web app context, and chrome.runtime.onMessageExternal() in the background script. This requires adding "matches" for the URLs you're injecting code into in an "externally_connectable" section of the manifest.json.
I believe this can be adapted to make it work with Dropbox.
I am trying to scrape a oracle adf faces rich client webpage but I am not getting the best of luck, I login automatically using node.js request module but after that I can't get to any other page with request. I get stuck on redirects, the loop script or simply don't get information I expect to.
I am using Wireshark to view every page and the way it handles, I recreate the page to match headers and even size but everytime the framework denies me access.
Before you ask, it's legal and I am not breaking any terms of service. Just trying to make a web api to speed up a process. I have used phantomjs with casperjs but get stuck on ajax calls that don't show on page and php curl but it's much easier with java.
Any suggestions are really really appreciated.
My bad on this one, wireshark was displaying fields as truncated, if you want to see the full field you need to right click the packet and click follow TCP stream, rich clients have very long posts generated by the framework behind the rich client and it appears I was missing about half of them when I did the calls.