Import from web - set user agent in Mathematica

Import from web - set user agent in Mathematica - browser

when I connect to my site with Mathermatica (Import["mysite","Data"]) and look at my Apache log I see:
99.XXX.XXX.XXX - - [22/May/2011:19:36:28 +0200] "GET / HTTP/1.1" 200 6268 "-" "Mathematica/8.0.1.0.0 PM/1.3.1"
Could I set it to be something like this (when I connects with real browser):
99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200] "GET /favicon.ico HTTP/1.1" 404 183 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24"

As far as I know you can't change the user agent string in Mathematica. I once used a proxy server (CNTLM) to get Mathematica to talk with a firewall which used NTLM authentication (which Mathematica doesn't support). CNTLM also allows you to set the user agent string.
You can find it at http://cntlm.sourceforge.net/. Basically, you set-up this proxy server to run on your own machine and set its port number and ip-address in the Mathematica network settings. The proxy adds user agent stuff and handles the NTLM authentication. Not sure how it works if you don't have a NTLM firewall. There are other free proxies around that might work for you.
EDIT The Squid http proxy seems to do what you want. It has the request_header_replace configuration directive which allows you to change the contents of request headers.

Here is a way to use the Apache HTTP client through JLink:
Needs["JLink`"]
ClearAll#urlString
urlString[userAgent_String, url_String] :=
JavaBlock#Module[{http, get}
, http = JavaNew["org.apache.commons.httpclient.HttpClient"]
; http#getParams[]#setParameter["http.useragent", MakeJavaObject#userAgent]
; get = JavaNew["org.apache.commons.httpclient.methods.GetMethod", url]
; http#executeMethod[get]
; get#getResponseBodyAsString[]
]
You can use this function as follows:
$userAgent =
"Mozilla/5.0 (X11;Linux i686) AppleWebKit/534.24 (KHTML,like Gecko) Chrome/11.0.696.68 Safari/534.24";
urlString[$userAgent, "http://www.htttools.com:8080/"]
You can feed the result to ImportString if desired:
ImportString[urlString[$userAgent, "mysite"], "Data"]
A streaming approach would be possible using more elaborate code, but the string-based approach taken above is probably good enough unless the target web resource is very large.
I tried this code in Mathematica 7 and 8, and I expect that it works in v6 as well. Beware that there is no guarantee that Mathematica will always include the Apache HTTP client in future releases.
How It Works
Despite being expressed in Mathematica, the solution is essentially implemented in Java. Mathematica ships with a Java runtime environment built-in and the bridge between Mathematica and Java is a component called JLink.
As is typical of such cross-technology solutions, there is a fair amount of complexity even when there is not much code. It is beyond the scope of this answer to discuss how the code works in detail, but a few items will be emphasized as suggestions for further reading.
The code uses the Apache HTTP client. This Java library was chosen because it ships as an unadvertised part of the standard Mathematica distribution -- and it also happens to be the one that Import appears to use internally.
The whole body of urlString is wrapped in JavaBlock. This ensures that any Java objects that are created over the course of operation are properly released by co-ordinating the activities of the Java and Mathematica memory managers.
JavaNew is used to create the relevant Apache HTTP client objects, HttpClient and GetMethod. Java expressions like http.getParams() are expressed in JLink as http#getParams[]. The Java classes and methods are documented in the Apache HTTP client documentation.
The use of MakeJavaObject is somewhat unusual. It is required in this case as a Mathematica string is being passed as an argument where a Java Object is expected. If a Java String was expected, JLink would automatically create one. But JLink is unable to make this inference when Object is expected, so MakeJavaObject is used to give JLink a hint.
What about URLTools?
Incidentally, the first thing I tried to answer this question was to use Utilities`URLTools`FetchURL. It looked very promising since it takes an option called "RequestHeaderFields". Alas, this did not work because the present implementation of that function uses that option only for HTTP POST verbs -- not GET. Perhaps some future version of Mathematica will support the option for GET.

I'm extremely lazy and curl is more flexible in less code than J/Link, without the object management issues. This is an example of posting data (userPass) to a url and retrieving the result in JSON format.
Import["!curl -A Mozilla/4.0 --data " <> userPass <> " " <> url, "JSON"]
I isolate this kind of thing in an impure function (unless it is pure) so I know it's tainted, but any web access is that way.
Because I use a pipe, MMA cannot deduce the type of file. ref/Import mentions that « Import["!prog","format"] imports data from a pipe. » and « The format of a file is by default deduced from the file extension in its name, or by FileFormat from its contents. » As a result, it is necessary to specify "CSV", "JSON", etc. as the format parameter. You'll see some strange results otherwise.
curl is a command line tool for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos...), file transfer resume, proxy tunneling and a busload of other useful tricks.
From the curl and libcurl welcome page.

Mathematica does all of its internet connectivity through a user specified proxy server. If, as Sjoerd suggested, setting one up is too much work, you might want to consider writing the call in C/C++, and then calling that from Mathematica. I don't doubt there are plenty of C libraries that do what you want in a few lines of code.
For calling C code within Mathematica, see the C Language Interface documentation

Mathematica 9 has the new URLFetch function. It has the option UserAgent.

You can also use J/Link to make your web requests or call curl or wget on the command line.

Related

How to change response header (cache) in CouchDB?

Do you know how to change the response header in CouchDB? Now it has Cache-control: must-revalidate; and I want to change it to no-cache.

I do not see any way to configure CouchDB's cache header behavior in its configuration documentation for general (built-in) API calls. Since this is not a typical need, lack of configuration for this does not surprise me.
Likewise, last I tried even show and list functions (which do give custom developer-provided functions some control over headers) do not really leave the cache headers under developer control either.
However, if you are hosting your CouchDB instance behind a reverse proxy like nginx, you could probably override the headers at that level. Another option would be to add the usual "cache busting" hack of adding a random query parameter in the code accessing your server. This is sometimes necessary in the case of broken client cache implementations but is not typical.
But taking a step back: why do you want to make responses no-cache instead of must-revalidate? I could see perhaps occasionally wanting to override in the other direction, letting clients cache documents for a little while without having to revalidate. Not letting clients cache at all seems a little curious to me, since the built-in CouchDB behavior using revalidated Etags should not yield any incorrect data unless the client is broken.

Non-HTTP[S] OAuth redirect for Nest API

I'm trying to develop a native OS X app that uses the Nest API. Unfortunately, their client registration only accepts "https://" URIs for the redirect-URL. Since there's no server involved in this (other than Nest's server), I need to redirect to my app. To do that, I need to be able to redirect to an arbitrary URI.
I tried to send this feedback to Nest directly, but they don't seem to have a support contact or bug reporting available.
Am I missing some other authentication approach for this type of use? It's a similar problem on iOS.
Thanks!

Nest can only assure in the normal browser world that HTTPS is secure. Yes, there are other application protocols that are secure, but the standards are not well defined. As such the return URIs are limited to HTTPS and HTTP://localhost (It is assumed that is someone has control of your machine, they can also intercept HTTPS calls)
Mac OS and iOS have a relatively simple workaround for this that is demonstrated in Nest's iOS NestDK sample code. The key parts are:
In line 30 of constants.m you will see that RedirectURL is defined (when running this sample code, you might want to change this to your preferred URL, likely something your company already controls for further security)
And in line 126 of NestWebViewAuthController.m where the app is checking if the WebView is trying to load our dummy redirect URI. If so, it captures the parameters and tries to get a token that can be used with the Nest API.

How do I make my browsers send accept-encoding: pack200-gzip?

I'd like to try the pack200 compression for a Java applet. I understand that the browser must support this for it to work, and according to documentation it does if it sends "Accept-encoding: pack200-gzip" to the server. However, my browsers (tried a couple) won't send that, only "Accept-encoding: gzip, deflate". Since I assumed the JRE is the key for the browser to use this new encoding, I've tried installing several Java REs from 1.6.0.34 to latest 1.7, but with no success. What am I missing here? Is there something I've misunderstood?
Googling this does not give much help unfortunally, I've tried!
Edit: OK I found out what I misunderstood. I was using a HTTP analyzer to see what the browser was sending to the server, but it's not the browser sending this particular requests of course, it's the JVM. Looking at the requests on server I see the correct accept-encoding being sent.

You can make JAWS support for your applet.
Both JNLP application and JNLP applet can be wrapped with pack200 and unwrapped on client machine... see jnlp desc for more details

REST vs Security

Couple of days back, i came across news about how hackers Stole 200,000+ Citi Accounts Just By Changing Numbers In The URL. Seems like Developers compromised security for being RESTful and also didn't bothered to keep session id in place of userId. I'm also working on a product where security is the main concern so I'm wondering whether we should avoid REST and use post requests everywhere in such a case? or am I missing something important about REST ?

Don't blame the model for a poor implementation, instead learn from the mistakes of others.
That's my (brief) opinion, but I'm sure better answers will be added :)
(P.S. - using Post doesn't increase the security in any way)

The kind of security issues mentioned in the question have largely nothing to do with REST, SOAP or Web. It has got to do with how one designs the applications.
Here is another common example. Say, there is a screen in an e-commerce application to show the details of an order. For a logged in user, the URL may look like this:
http://example.com/site/orders?orderId=1234
Assuming that orders are globally unique (across all users of that system), one could easily replace that orderId with some other valid OrderId not belonging to the user and see the details. The simple way to protect this is to make sure that the underlying query (SQL etc) has user's Id also added as conjunction (AND in a WHERE clause for SQL).
In this specific case, a good application design would have ensured that the account id coming in the URL is verified with the associated authenticated session.

The same data gets transmitted across the wire whether you use GET or POST. Here is a sample GET request that is the result of submitting a form [took out User-Agent value because it was long]:
GET /Testing/?foo=bar&submit=submit HTTP/1.1
Host: localhost
User-Agent: ...
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://localhost/Testing/demoform.html
Now here's the same request as a POST:
POST /Testing/ HTTP/1.1
Host: localhost
User-Agent: ...
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://localhost/Testing/demoform.html
Content-Type: application/x-www-form-urlencoded
Content-Length: 21
foo=bar&submit=submit
Note that this is what the server sees when you submit the request or what a man-in-the-middle attacker might see while intercepting the request.
In the GET we see that foo = bar and submit = submit on the first line of the request. In the POST we have to look at the last line to see that...hey! foo = bar and submit = submit. Same thing.
At the browser level, the difference manifests itself in the address bar. The first request shows the ?foo=bar&submit=submit string and the second one does not. A malicious person that wants to intercept this data doesn't care about what appears in the browser's address bar. The main danger comes about because anyone can copy a URL out of the address bar and pass it around thus leaking the parameters; in fact it is very common for people to do that.
The only way to keep our malicious person from seeing either of these types of requests is for it all to be encrypted before it is sent to the server. The server provides the public key which is used (via the https protocol and SSL/TLS). The browser uses the key to encrypt the request and the server decrypts it with its private key. There is still an issue on the client side as to whether the key it received from the server actually belongs to the people running the server. This has to be verified via some out of band trust system like a third party verification or fingerprint comparisons or something like that.
All this is completely orthogonal to REST. Regardless of how you do it, if you are communicating information across the wire with HTTP you are going to have this issue and you're going to need to encrypt the requests/responses to prevent malicious people from seeing them.

POST requests are no safer than RESTful requests, which are no safer than GET requests.
There are a range of measures to increase the security of your application, that cannot all be listed here. Wikipedia has a good number of them and methods to prevent them.
Here's an example: GET requests shouldn't be used for critical actions such as withdrawing a bank account, because if you're logged into your bank account, someone can set a rouge image with the source as http://yourbank.com/actions/withdraw/300USD , and the URL will be loaded, withdrawing money from your bank account. This is easily countered by using a POST request.
And then, we have some further security considerations to take dealing with this post request, because again it can be spoofed.

Using POST instead of GET as a security measure is simply a use of "security through obscurity". In reality it is no safer as any one with a small amount of knowledge of HTTP can forge a POST request.
Using sessions ids instead of user ids is also just another way of obscuring the security hole, it's not really fixing the problem as session ids can be hijacked too.
The fact that in this particular case security hole was made extremely easy to exploit by changing the URL does not make the use of REST the cause of the security issue.
As others have mentioned, whenever you need to secure REST services, HTTPS is the place to start looking.

I would consider REST and all web applications security concerns very similar.
The problem stated in the question is considered "school" error - something experienced web developer would not do. So if you understand web app security - you'll understand REST security as well.
So add experienced web developer to your team if you don't have any - he'll help you with REST.

If security is the main concern, exclusively use https:// and POST, never http:// and GET.
This will avoid the attack you describe as well as many others such as session hijacking, and simple eavesdropping on the line.
(... and, abstain from making the mistake of authenticating with https:// and switching to http:// later, which was "de facto standard" until a few months ago when someone published a tool which did the obvious)

Safe implementation of script tag hack to do XSS?

Like a lot of developers, I want to make JavaScript served up by Server "A" talk to a web service on Server "B" but am stymied by the current incarnation of same origin policy. The most secure means of overcoming this (that I can find) is a server script that sits on Server "A" and acts as a proxy between it and "B". But if I want to deploy this JavaScript in a variety of customer environments (RoR, PHP, Python, .NET, etc. etc.) and can't write proxy scripts for all of them, what do I do?
Use JSONP, some people say. Well, Doug Crockford pointed out on his website and in interviews that the script tag hack (used by JSONP) is an unsafe way to get around the same origin policy. There's no way for the script being served by "A" to verify that "B" is who they say they are and that the data it returns isn't malicious or will capture sensitive user data on that page (e.g. credit card numbers) and transmit it to dastardly people. That seems like a reasonable concern, but what if I just use the script tag hack by itself and communicate strictly in JSON? Is that safe? If not, why not? Would it be any more safe with HTTPS? Example scenarios would be appreciated.
Addendum: Support for IE6 is required. Third-party browser extensions are not an option. Let's stick with addressing the merits and risks of the script tag hack, please.

Currently browser venders are split on how cross domain javascript should work. A secure and easy to use optoin is Flash's Crossdomain.xml file. Most languages have a Cross Domain Proxies written for them, and they are open source.
A more nefarious solution would be to use xss how the Sammy Worm used to spread. XSS can be used to "read" a remote domain using xmlhttprequest. XSS isn't required if the other domains have added a <script src="https://YOUR_DOMAIN"></script>. A script tag like this allows you to evaluate your own JavaScript in the context of another domain, which is identical to XSS.
It is also important to note that even with the restrictions on the same origin policy you can get the browser to transmit requests to any domain, you just can't read the response. This is the basis of CSRF. You could write invisible image tags to the page dynamically to get the browser to fire off an unlimited number of GET requests. This use of image tags is how an attacker obtains documnet.cookie using XSS on another domain. CSRF POST exploits work by building a form and then calling .submit() on the form object.
To understand the Same Orgin Policy, CSRF and XSS better you must read the Google Browser Security Handbook.

Take a look at easyXDM, it's a clean javascript library that allows you to communicate across the domain boundary without any server side interaction. It even supports RPC out of the box.
It supports all 'modern' browser, as well as IE6 with transit times < 15ms.
A common usecase is to use it to expose an ajax endpoint, allowing you to do cross-domain ajax with little effort (check out the small sample on the front page).

What if I just use the script tag hack by itself and communicate strictly in JSON? Is that safe? If not, why not?
Lets say you have two servers - frontend.com and backend.com. frontend.com includes a <script> tag like this - <script src="http://backend.com/code.js"></script>.
when the browser evaluates code.js is considered a part of frontend.com and NOT a part of backend.com. So, if code.js contained XHR code to communicate with backend.com, it would fail.
Would it be any more safe with HTTPS? Example scenarios would be appreciated.
If you just converted your <script src="https://backend.com/code.js> to https, it would NOT be any secure. If the rest of your page is http, then an attacker could easily man-in-the-middle the page and change that https to http - or worse, include his own javascript file.
If you convert the entire page and all its components to https, it would be more secure. But if you are paranoid enough to do that, you should also be paranoid NOT to depend on an external server for you data. If an attacker compromises backend.com, he has effectively got enough leverage on frontend.com, frontend2.com and all of your websites.
In short, https is helpful, but it won't help you one bit if your backend server gets compromised.
So, what are my options?
Add a proxy server on each of your client applications. You don't need to write any code, your webserver can automatically do that for you. If you are using Apache, look up mod_rewrite
If your users are using the latest browsers, you could consider using Cross Origin Resource Sharing.
As The Rook pointed out, you could also use Flash + Crossdomain. Or you could use Silverlight and its equivalent of Crossdomain. Both technologies allow you to communicate with javascript - so you just need to write a utility function and then normal js code would work. I believe YUI already provides a flash wrapper for this - check YUI3 IO
What do you recommend?
My recommendation is to create a proxy server, and use https throughout your website.

Apologies to all who attempted to answer my question. It proceeded under a false assumption about how the script tag hack works. The assumption was that one could simply append a script tag to the DOM and that the contents of that appended script tag would not be restricted by the same origin policy.
If I'd bothered to test my assumption before posting the question, I would've known that it's the source attribute of the appended tag that's unrestricted. JSONP takes this a step further by establishing a protocol that wraps traditional JSON web service responses in a callback function.
Regardless of how the script tag hack is used, however, there is no way to screen the response for malicious code since browsers execute whatever JavaScript is returned. And neither IE, Firefox nor Webkit browsers check SSL certificates in this scenario. Doug Crockford is, so far as I can tell, correct. There is no safe way to do cross domain scripting as of JavaScript 1.8.5.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string