How to control web browser using some programming language? - browser

I am looking for a way to control a web browser such as firefox or chrome. I need something like "selenium webdriver" but that will allow me to open many instances URL load, get http headers, response code, get response content, load time, etc.
Is there any library, framework, api that I could use to do it? I couldn't find one exactly that does all, selenium opens browser and go to url but I can't get http headers

Selenium and Jellyfish are strong options in general. Jellyfish is an option that uses Node.js - although I have no experience with it, I've heard good things from my colleagues.
If you just want to get headers and such, you could use the cURL library or wget. I've used cURL with NuSOAP to query XML web services in PHP, for example. The downside is that these are not functional browsers, and merely perform the HTTP requests and consume the response.
http://seleniumhq.org/
https://github.com/admc/jellyfish
http://curl.haxx.se/

Related

HTTP GET request on website does not return full body (website scraping)

I am trying to scrape a website, but some pages of that website will be fully returned upon a GET request. I don't wan't to disclose the URL of said website but still I'd like to ask for help in this regard.
I've implemented HTTP requests to log into the member area of that website, which works fine. Then, I'd like to get a list of conversations, however, when I compare the response from the firefox developer tools (of the same GET to the same location with the same parameters), I will see the full HTML in firefox dev tools, but in my implementation (using the request nodejs module) I will only see the inner <div id="content">...</div>, without any javascript or surrounding HTML.
How can this be? I understand javascript can inject HTML afterwards, but how should this be possible if no javascript has been received by my scraping implementation? What is different in firefox? I understand that in firefox probably their javascript client is running and doing the GET request, which then inserts content. However, the firefox log shows a HTTP GET request (no XHR) and it shows the full response in the dev tools. How is this possible?
Anyone got a hint on how to proceed on this further?

What's the easiest way to request a list of web pages from a web server one by one?

Given a list of URLs, how does one implement the following automated task (assuming windows and ubuntu are the available O/Ses)? Are there existing types of tools that can make implementing this easier or do this out of the box?
log in with already-known credentials
for each specified url
request page from server
wait for page to be returned (no specific max time limit)
if request times out, try again (try x times)
if server replies, or x attempts failed, request next url
end for each
// Note: this is intentionally *not* asynchronous to be nice to the web-server.
Background: I'm implementing a worker tool that will request pages from a web server so the data those pages need to crunch through will be cached for later. The worker doesn't care about the resulting pages' contents, although it might care about HTML status codes. I've considered a phantom/casper/node setup, but am not very familiar with this technology and don't want to reinvent the wheel (even though it would be fun).
You can request pages easily with the http module.
Here's an example.
Some people prefer the request module available in npm.
Here's a link to the github page
If you need more than that, you can use phantomjs.
Here's a link to the github page for bridging node and phantom
However, you could also look for simple cli commands for making requests such as wget or curl.

How to catch a flash stream url from browser plugin

My question has similar point like this one.
I’m wondering how I can catch a media URL which SWF loads from browser add-on. Let’s say YouTube flash player starts playing or loading some video (let it be via http) and I want to know that url. Just like browser plugins from “RealDownloader” and “Moyea YouTube FLV Downloader” does. I’m newbie with plugin development and flash and I want to know what technologies it may be. XPCOM, NPAPI, ActiveX, or simple API hooking. Any ideas how this may be accomplished?
NPAPI plugins typically ask the browser to load data for them, they don't do it themselves. This means that a browser extension can intercept these requests. This can be done for example by implementing a content policy. Requests initiated by a plugin will cause a shouldLoad call with type OBJECT_SUBREQUEST.
The simpler option is using HTTP observers - but this way you won't recognize requests initiated by Flash, they will look just like any other request processed by the browser.
Firebug does that, and it's open source. Why not study it a little?
https://github.com/firebug/
It's easy if you only want to get the url from a single swf in a single website. for example if all you need are urls from that swf,you can keep only one instance of your browser open and use a tool to intercept its http requests.

Which language is used to make google docs and box.net?

I want to know how and which things are used to make google docs and box.net ?
Most of the UI functionality comes from using Javascript and HTML's DOM together with AJAX, a technique for using JS to make additional requests of the server without reloading the page.
In terms of the back-end languages (that provide the dynamic content) box.net returns PHPSESSID as part of it's set-cookie http response. They're also running nginx. So I would suspect one of the many PHP frameworks as being in use.
As for google docs, Google are known to use python quite extensively. Google's "App Engine" uses Python or Java as its languages (I believe Python was added first). So I suspect they use some customised form of python based on their own instance of their own app engine. Their http headers give nothing away, except that the Server: GSE line.
According to HowStuffWorks, Google Docs uses Java for the backend and JavaScript for the front end. Of course, HTML is in the mix there as well.
As for the database it uses, Google won't say. It will use the cloud though, we can be sure of that.

HTTP request builders for GNU/Linux?

I’m looking for tools for interactively inspecting HTTP servers by manually constructing requests (and viewing responses), under GNU/Linux. Something that would let me quickly specify standard header fields, make a form request body, etc. (netcat doesn’t really excel at this.)
Any suggestions?
Maybe simply cURL is what you need?
A Python script with urllib2 would seem appropriate. You can manipulate headers at will. Of course you have access to all the request/response fields too. A tutorial can be found here.
If all you are looking to do is to make manual HTTP requests to interact with web services for testing / reference purposes, then there is a really nice add-on for Firefox that does just that. It is called Poster.

Resources