Amazon Mechanical Turk - ExternalHit Sample - amazon

I'm tyring to run the ExternalHit Sample that comes with the Command Line Tools installer. For the .question file I have the following ...
<?xml version="1.0"?>
<ExternalQuestionxmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2006-07-14/ExternalQuestion.xsd">
<ExternalURL>https://s3.amazonaws.com/MTurk_test/externalpage.htm?url=${helper.urlencode($urls)}</ExternalURL>
<FrameHeight>450</FrameHeight>
We kept the input file the same as in the sample. When we loadhits in either the sandbox or the standard mechanical turk interface the input variable (i.e. the web page) does not display in the frame. Additionally, you can select a radio button but you cannot submit your answer.

You're running into two problems here:
Due to same origin policy, the <iframe> in the resulting page is not going to display an arbitrary URL.
Even if it could display an arbitrary URL, that URL would have to be configured for SSL.
So, you'll be able to load this (another amazon URL using SSL):
https://s3.amazonaws.com/MTurk_test/externalpage.htm?url=https%3A%2F%2Fwww.mturk.com
But not this (amazon URL w/o SSL):
https://s3.amazonaws.com/MTurk_test/externalpage.htm?url=http%3A%2F%2Fwww.mturk.com
And not an arbitrary URL (even with SSL):
https://s3.amazonaws.com/MTurk_test/externalpage.htm?url=https%3A%2F%2Fwww.google.com
So, my guess is that this template is horribly outdated and used to work at some point in the past but is not compliant with modern web browser technology.
Best solution is just to provide a link for the worker to click to visit the URL.

Related

Why does the Foursquare API JS not work with HTTPS?

In a system I have to maintain (didn't build it, just inherited it) we have a Foursquare implementation that hasn't been used in quite a while. Trying to revive it failed, because our page is now loaded via HTTPS, which it didn't used to be.
We are using the "Save to Foursquare" button as well as the API request to retrieve the number of Check-ins. I already switched all the JS includes and intent links from http to https and at least now it shows the number and the button correctly.
However, I can't click the button and checking the browser's console I found that it added a script tag to the head of this page which tries to access http://platform.foursquare.com/js/modules/widgets.asyncbundle.js. The browser obviously blocks this, because it's not using HTTPS.
The file we are explicitly loading is https://platform.foursquare.com/js/widgets.js. It seems to me like this script is not reacting correctly to HTTP vs. HTTPS. There is probably a very simple solution to this, so what am I missing?
I don't know if you've tried it yet but the foursquare website says this on the matter:
Change the source of the JavaScript file to https://platform-s.foursquare.com/js/widgets.js
Add {"secure":true} to the global configuration block (window.___fourSq)`
The same link (see below) has all the different ways to call the Save To Foursquare function using its .saveTo() function.
https://developer.foursquare.com/overview/widgets
I hope this information and links helps! Cheers.

Scraping adf faces oracle rich client

I am trying to scrape a oracle adf faces rich client webpage but I am not getting the best of luck, I login automatically using node.js request module but after that I can't get to any other page with request. I get stuck on redirects, the loop script or simply don't get information I expect to.
I am using Wireshark to view every page and the way it handles, I recreate the page to match headers and even size but everytime the framework denies me access.
Before you ask, it's legal and I am not breaking any terms of service. Just trying to make a web api to speed up a process. I have used phantomjs with casperjs but get stuck on ajax calls that don't show on page and php curl but it's much easier with java.
Any suggestions are really really appreciated.
My bad on this one, wireshark was displaying fields as truncated, if you want to see the full field you need to right click the packet and click follow TCP stream, rich clients have very long posts generated by the framework behind the rich client and it appears I was missing about half of them when I did the calls.

sending a message from a web app to an extension

I have an extension which provides a number of services to any web app that requires them. I had been assuming that a web app could use chrome.runtime.sendMessage(ext-id,message), but when I try, there is no sendMessage function on chrome.runtime.
Have I misunderstood where sendMessage can be used, and is there another technique that I can use to communicate from an arbitrary web app to my extension?
There are a few options.
First, http://developer.chrome.com/extensions/manifest/externally_connectable.html is the closest to how you're thinking about it right now. You're expecting to be able to add proprietary, Chrome-specific functionality to arbitrary web pages. externally_connectable will give you a limited version of (see http://developer.chrome.com/extensions/messaging.html#external-webpage for an example), but only for specific web pages (e.g., *.yourdomain.com but not *.com).
Second, you can postMessage from your web page to a content script (see http://developer.chrome.com/extensions/content_scripts.html#host-page-communication), which can do anything a content script can. If you need chrome.* APIs at that point, you can message from the content script to your extension's page, which has access to any chrome.* APIs that it's asked for.
Finally, depending on what your "number of services" actually is, you can always executeScript another script directly into a target webpage, which is similar to forcing the webpage to include it as if it were another <script> tag. (Only similar to, not identical to, because the injection typically happens after the page has loaded.)

how to check if my website is being accessed using a crawler?

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.
This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.
You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.
I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

Why does new Facebook Javascript SDK not violate the "same origin policy"?

The new Facebook Javascript SDK can let any website login as a Facebook user and fetch data of a user...
So it will be, www.example.com including some Javascript from Facebook, but as I recall, that script is considered to be of the origin of www.example.com and cannot fetch data from facebook.com, because it is a violation of the "same origin policy". Isn't that correct? If so, how does the script fetch data?
From here: https://developer.mozilla.org/en/Same_origin_policy_for_JavaScript
The same origin policy prevents a
document or script loaded from one
origin from getting or setting
properties of a document from another
origin. This policy dates all the way
back to Netscape Navigator 2.0.
and explained slightly differently here: http://docs.sun.com/source/816-6409-10/sec.htm
The same origin policy works as
follows: when loading a document from
one origin, a script loaded from a
different origin cannot get or set
specific properties of specific
browser and HTML objects in a window
or frame (see Table 14.2).
The Facebook script is not attempting to interact with script from your domain or reading DOM objects. It's just going to do its own post to Facebook. It gets yous site name, not by interacting with your page, or script from your site, but because the script itself that is generated when you fill out the form to get the "like" button. I registered a site named "http://www.bogussite.com" and got the code to put on my website. The first think in this code was
iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.bogussite.com&
so the script is clearly getting your site info by hard-coded URL parameters in the link to the iFrame.
Facebook's website is by far not alone in having you use scripts hosted on their servers. There are plenty of other scripts that work this way.. All of the Google APIs, for example, including Google Gears, Google Analytics, etc require you to use a script hosted on their server. Just last week, while I was trying to figure out how to do geolocation for our store finder for a mobile-friendly web app, I found a whole slew of geolocation services that had you use scripts hosted on their servers, rather than copying the script to your server.
I think, but am not sure, that they use the iframe method. At least the cross domain receiver and xfbml stuff for canvas apps uses that. Basically the javascript on your page creates an iframe within the facebook.com domain. That iframe then has permission to do whatever it needs with facebook. Communication back with the parent can be done with one of several methods, for example the url hash. But I'm not sure which if any method they use for that part.
If I recall, they use script tag insertion. So when a JS SDK call needs to call out to Facebook, it inserts a <script src="http://graph.facebook.com/whatever?params...&callback=some_function script tag into the current document. Then Facebook returns the data in JSON format as some_function({...}) where the actual data is inside the ... . This results in the function some_function being called in the origin of example.com using data from graph.facebook.com.

Resources