Using Google AppEngine Urlfetch instead of urllib2 - security

What is the difference between Google's urlfetch over the python lib urllib2 ?
When I came upon Google's urlfetch I thought maybe there were security reasons. Perhaps Google is safer in terms of malicous urls or something?
Is there any reason why I should choose Google's urlfetch over urllib2?

Note that in GAE urllib, urllib2 and httplib are just wrappers around UrlFetch (See Fetching urls in Python).
One difference of the urlfetch module is that provides you with an interface for making Asynchronous requests.

I don't work for Google, so this is just a guess from various GAE posts I've read. App Engine instances don't face the internet directly, but are buried behind layers of Google infrastructure. When a browser makes an HTTP request, it doesn't go straight to your instance, but rather it hits a Google edge server which eventually routes the request to a GAE instance.
Likewise when making an HTTP request out, your instance doesn't just open a socket (which urllib2 will normally do), but rather it sends the HTTP request to some other Google edge server which goes makes that HTTP request. Using urllib2 on GAE will use a GAE specific version which runs on top of urlfetch.

There is no problem to use standard libraries in App Engine. Url Fetch Api is just a service to make HTTP request more "easily" than urlib2. It is more understable for a novice in Python and you can easily use a non blocking request for example.
I suggest you to read some complementary information here: https://developers.google.com/appengine/docs/python/urlfetch/overview
If google found some security problem on a Python standard libraries. I guess It will send a fix ;)

The difference is : urlfetch only has a functional interface and urllib and httplib have a OO interface. An OO interface can be very usefull. I have seen a good example in the oauth2 client lib, where the request instance is passed to the client lib to check if the token is valid and authorized.

Related

Is it possible to identify which client sent a HTTP request?

Is it possible to identify the client / library which sent a HTTP request?
I am trying to fetch some data via an API and it is possible to query the API via cURL and python, but when I try to use node (doesn't matter which library, axios requests, unirest, native, ...) or wget I get a proprietary error back from the backend.
Now I am wondering, if the backend is able to identify, which library I am using?
More information:
The requests are exactly the same, so no way to distinguish them
The user-agent header field is set and overwritten for all requests
I already tried to monitor the traffic in wireshark, but couldn't find any differences with the packets on HTTP layer (only the order of the header fields is different, that according to the standard this shouldn't make a difference)
It turns out that the problem was TLS fingerprinting.
See: https://httptoolkit.tech/blog/tls-fingerprinting-node-js/
Nodejs uses google V8 JS engine, V8 based http request clients will not allow you to override headers that would compromise 'web safety', so for example if you are setting "Origin, Host, Referrer" headers, node might refuse to do so. I had the same issue previously.
Un-opinionated http clients, such as the ones written in C++(curl) and python won't 'web safety' check your requests, so that is what is causing the difference in behavior.
In my case I used a C++ library that I called from javascript to make my 'unsafe' requests and the problem was solved.

How HTTP response is generated

I'm fairly new to programming and this question is about making sure I get the HTTP protocol correctly. My issue is that when I read about HTTP request/response, it looks like it needs to be in a very specific format with a status code, HTTP version number, headers, a blank line followed by the body.
However, after creating a web app with nodejs/express, I never once had to actually write code that made an HTTP response in this format (I'm assuming, although I don't know for sure that other frameworks like ruby on rails or python/Django are the same). In the express app, I just set up the route handlers to render the appropriate pages, when a request was made to that route.
Is this because express is actually putting the response in the correct HTTP format behind the scenes? In other words, if I looked at the expressJS code, would there be something in that code that actually makes an HTTP response in the HTTP format?
My confusion is that, it seems like the HTTP request/response format is so important but somehow I never had to write any code dealing with it for a node/express application. Maybe this is the entire point of a framework like express... to take out the details so that developers can deal with business logic. And if that is correct, does anyone ever write web apps without a framework to do this. Would you then be responsible for writing code that puts the server's response into the exact HTTP format?
I'm fairly new to programming and this question is about making sure I get the HTTP protocol correctly. My issue is that when I read about HTTP request/response, it looks like it needs to be in a very specific format with a status code, HTTP version number, headers, a blank line followed by the body.
Just to give you an idea, there are probably hundreds of specifications that have something to do with the HTTP protocol. They deal with not only the protocol itself, but also with the data format/encoding for everything you send including headers and all the various content types you can send, authentication schemes, caching, status codes, URL decoding, etc.... You can see some of the specifications involved just by looking here: https://www.w3.org/Protocols/.
Now a simple request and a simple text response could get away with only knowing a few of these specifications, but life is not always that simple.
Is this because express is actually putting the response in the correct HTTP format behind the scenes? In other words, if I looked at the expressJS code, would there be something in that code that actually makes an HTTP response in the HTTP format?
Yes, there would. A combination of Express and the HTTP library that is built into node.js handle all the details of the specification for you. That's the advantage of using a library/framework. They even handle different versions of the protocol and feedback from thousands of other developers have helped them to clean up edge case bugs. A good library/framework allows you to still control any detail about the response (headers, content types, status codes, etc..) without making you have to go through the detail work of actually creating the exact response. This is a good thing. It lets you write code faster and lets you ride on the shoulders of others who have already figured out minutiae details that have nothing to do with the logic of your app.
In fact, one could say the same about the TCP protocol below the HTTP protocol. No regular app developer wants to write their own TCP stack. Instead, you just want a working TCP stack that you can use that's already been tuned and debugged for you.
However, after creating a web app with nodejs/express, I never once had to actually write code that made an HTTP response in this format (I'm assuming, although I don't know for sure that other frameworks like ruby on rails or python/Django are the same). In the express app, I just set up the route handlers to render the appropriate pages, when a request was made to that route.
Yes, this is a good thing. The framework did the detail work for you. You just call res.setHeader(), res.status(), res.cookie(), res.send(), res.json(), etc... and Express makes the entire response for you.
And if that is correct, does anyone ever write web apps without a framework to do this. Would you then be responsible for writing code that puts the server's response into the exact HTTP format?
If you didn't use a framework or library of any kind and were programming at the raw TCP level, then yes you would be responsible for all the details of the HTTP protocol. But, hardly anybody other than library developers ever does this because frankly it's just a waste of time. Every single platform has at least one open source library that does this already and even if you were working on a brand new platform, you could go get an open source body of code and port it to your platform much quicker than you could write all this yourself.
Keep in mind that one of the HUGE advantages of node.js is that there's an enormous body of open source code (mostly in NPM and Github) already prepackaged to work with node.js. And, because node.js is server-side where code memory isn't usually tight and where code just comes from the local hard disk at server init time, there's little downside to grabbing a working and tested package that does what you already need, even if you're only going to use 5% of the functionality in the package. Or, worst case, clone an existing repository and modify it to perfectly suit your needs.
Is this because express is actually putting the response in the
correct HTTP format behind the scenes?
Yes, exactly, HTTP is so ubiquitous that almost all programming languages / frameworks handle the actual writing and parsing of HTTP behind the scenes.
Does anyone ever write web apps without a framework to do this. Would
you then be responsible for writing code that puts the server's
response into the exact HTTP format?
Never (unless you're writing code that needs very low level tweaking of HTTP code or something)

What is the difference between http module and express module?

I'm learning NodeJs from: http://www.tutorialspoint.com/nodejs/
And I cant understand what is the difference between using http module (get/post methods) vs using express module (get/post methods)
It seems that express module is rapid for development.
Are there advantages to use http module compared to express module ?
Are there advantages to use express module compared to http module ?
Thanks
Express is not a "module", it's a framework: it gives you an API, submodules, and methodology and conventions for quickly and easily tying together all the components necessary to put up a modern, functional web server with all the conveniences necessary for that (static asset hosting, templating, handling CSRF, CORS, cookie parsing, POST data handling, you name it, it probably lets you use it).
The http API that's baked into Node.js, on the other hand, is just the http module: it can set up connections and send and receive data, as long the connections use the hypertext transfer protocol (with the relevant HTTP verb) and that's... well that's it. That's all it does.
They are completely different things. As many articles that you can find by searching the web for the details on both will tell you.

How to control web browser using some programming language?

I am looking for a way to control a web browser such as firefox or chrome. I need something like "selenium webdriver" but that will allow me to open many instances URL load, get http headers, response code, get response content, load time, etc.
Is there any library, framework, api that I could use to do it? I couldn't find one exactly that does all, selenium opens browser and go to url but I can't get http headers
Selenium and Jellyfish are strong options in general. Jellyfish is an option that uses Node.js - although I have no experience with it, I've heard good things from my colleagues.
If you just want to get headers and such, you could use the cURL library or wget. I've used cURL with NuSOAP to query XML web services in PHP, for example. The downside is that these are not functional browsers, and merely perform the HTTP requests and consume the response.
http://seleniumhq.org/
https://github.com/admc/jellyfish
http://curl.haxx.se/

Which language is used to make google docs and box.net?

I want to know how and which things are used to make google docs and box.net ?
Most of the UI functionality comes from using Javascript and HTML's DOM together with AJAX, a technique for using JS to make additional requests of the server without reloading the page.
In terms of the back-end languages (that provide the dynamic content) box.net returns PHPSESSID as part of it's set-cookie http response. They're also running nginx. So I would suspect one of the many PHP frameworks as being in use.
As for google docs, Google are known to use python quite extensively. Google's "App Engine" uses Python or Java as its languages (I believe Python was added first). So I suspect they use some customised form of python based on their own instance of their own app engine. Their http headers give nothing away, except that the Server: GSE line.
According to HowStuffWorks, Google Docs uses Java for the backend and JavaScript for the front end. Of course, HTML is in the mix there as well.
As for the database it uses, Google won't say. It will use the cloud though, we can be sure of that.

Resources