I have a web application that has some pretty intuitive URLs, so people have written some Chrome extensions that use these URLs to make requests to our servers. Unfortunately, these extensions case problems for us, hammering our servers, issuing malformed requests, etc, so we are trying to figure out how to block them, or at least make it difficult to craft requests to our servers to dissuade these extensions from being used (we provide an API they should use instead).
We've tried adding some custom headers to requests and junk-json-preamble to responses, but the extension authors have updated their code to match.
I'm not familiar with chrome extensions, so what sort of access to the host page do they have? Can they call JavaScript functions on the host page? Is there a special header the browser includes to distinguish between host-page requests and extension requests? Can the host page inspect the list of extensions and deny certain ones?
Some options we've considered are:
Rate-limiting QPS by user, but the problem is not all queries are equal, and extensions typically kick off several expensive queries that look like user entered queries.
Restricting the amount of server time a user can use, but the problem is that users might hit this limit by just navigating around or running expensive queries several times.
Adding static custom headers/response text, but they've updated their code to mimic our code.
Figuring out some sort of token (probably cryptographic in some way) we include in our requests that the extension can't easily guess. We minify/obfuscate our JS, so are ok with embedding it in the JS source code (since the variable name it would have would be hard to guess).
I realize this may not be a 100% solvable problem, but we hope to either give us an upper hand in combatting it, or make it sufficiently hard to scrape our UI that fewer people do it.
Welp, guess nobody knows. In the end we just sent a custom header and starting tracking who wasn't sending it.
Related
So i have to get a client's browser and os name. But the thing is that we don't want the user to be able to manipulate information about os or browser. But some websites show that there is only one way to do it that is by using request header userAgent.
Below are the links I've been through:
Retrieving Browser, OS and Device Type By Parsing User Agent
How to prevent user-agent to be changed by user
How do I prevent websites from detecting my OS? Which browser should I use?
so according to these we can only do it with the help of userAgent And it is not a difficult thing for a client to change it and also there is no way that we can detect that if a client has modified it. And it turns out that even mnc's like amazon and facebook rely on userAgent.
So on learning about Device fingerprint i got to know about a javascript library called FingerprintJs and it seems that they don't rely on userAgent for finding out the clients os name as i tried using it and turns out that on manipulating userAgent i got the original result. I am still trying to figure out how they exactly work for getting the os and browser name. And even if client can manipulate this too is there still a way that we can atleast make it difficult for a client to fake about browser and os ?
You are not able to restrict values that are sent with a request to your server. A user will always be able to use e.g. curl to send some arbitrary headers, cookies, etc. You can make it more difficult to tamper with the values through some obscurity, but that is not making such a solution secure.
Device fingerprinting might help, but you will most probably get blocked by ad blockers as they target fingerprinting as well. Still, even if you do implement device fingerprinting and get more accurate data about the user's browser, the user still can tamper with requests and change that data.
I don't know what are your requirements, but normally, you shouldn't be that much concerned with the user's browser or OS.
As there's no guaranteed way of knowing the user's OS/browser (since the user is able to send anything with their request), the more important question to ask may be:
Why do you want to know the user's OS/browser?
This can help us find a better answer for your actual requirements.
For example, this might help: https://developer.mozilla.org/en-US/docs/Web/HTTP/Browser_detection_using_the_user_agent#considerations_before_using_browser_detection
One method I can think of, is through a custom browser extension/plugin. You may even be able to use a browser API, depending on the target browser.
You would then craft a payload, which would compute/calculate the "client signature" out-of-band, not within the browsers standard request cycles and compute a signed, self validating hash, stored as a cookie.
This would require some knowledge of the related layers involved.
You are essentially talking about device fingerprinting.
While there are a vast number of approaches, you may not really want to maintain the overhead required, as it is generally done using multiple approaches, many of which are accomplished by exploiting bugs in browsers, http protocals, network routing analysis and even the clever targeting of numerous OS bugs and or quirks.
A much simpler approach is to feed your user a hashed cookie, with a scheme to detect if it's been modified. That cookie, along with other authentication and verification mechanisms would be far simpler and may be enough for your purposes.
There are 3rd party APIs which provide such a service, if it's really mission critical.
Of course philosophically speaking, if weather or not should you be fingerprinting your users? Is really up to you and the expectations of your users.
But there you go, I hope that provides a broader view of what's involved.
How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.
Like a lot of developers, I want to make JavaScript served up by Server "A" talk to a web service on Server "B" but am stymied by the current incarnation of same origin policy. The most secure means of overcoming this (that I can find) is a server script that sits on Server "A" and acts as a proxy between it and "B". But if I want to deploy this JavaScript in a variety of customer environments (RoR, PHP, Python, .NET, etc. etc.) and can't write proxy scripts for all of them, what do I do?
Use JSONP, some people say. Well, Doug Crockford pointed out on his website and in interviews that the script tag hack (used by JSONP) is an unsafe way to get around the same origin policy. There's no way for the script being served by "A" to verify that "B" is who they say they are and that the data it returns isn't malicious or will capture sensitive user data on that page (e.g. credit card numbers) and transmit it to dastardly people. That seems like a reasonable concern, but what if I just use the script tag hack by itself and communicate strictly in JSON? Is that safe? If not, why not? Would it be any more safe with HTTPS? Example scenarios would be appreciated.
Addendum: Support for IE6 is required. Third-party browser extensions are not an option. Let's stick with addressing the merits and risks of the script tag hack, please.
Currently browser venders are split on how cross domain javascript should work. A secure and easy to use optoin is Flash's Crossdomain.xml file. Most languages have a Cross Domain Proxies written for them, and they are open source.
A more nefarious solution would be to use xss how the Sammy Worm used to spread. XSS can be used to "read" a remote domain using xmlhttprequest. XSS isn't required if the other domains have added a <script src="https://YOUR_DOMAIN"></script>. A script tag like this allows you to evaluate your own JavaScript in the context of another domain, which is identical to XSS.
It is also important to note that even with the restrictions on the same origin policy you can get the browser to transmit requests to any domain, you just can't read the response. This is the basis of CSRF. You could write invisible image tags to the page dynamically to get the browser to fire off an unlimited number of GET requests. This use of image tags is how an attacker obtains documnet.cookie using XSS on another domain. CSRF POST exploits work by building a form and then calling .submit() on the form object.
To understand the Same Orgin Policy, CSRF and XSS better you must read the Google Browser Security Handbook.
Take a look at easyXDM, it's a clean javascript library that allows you to communicate across the domain boundary without any server side interaction. It even supports RPC out of the box.
It supports all 'modern' browser, as well as IE6 with transit times < 15ms.
A common usecase is to use it to expose an ajax endpoint, allowing you to do cross-domain ajax with little effort (check out the small sample on the front page).
What if I just use the script tag hack by itself and communicate strictly in JSON? Is that safe? If not, why not?
Lets say you have two servers - frontend.com and backend.com. frontend.com includes a <script> tag like this - <script src="http://backend.com/code.js"></script>.
when the browser evaluates code.js is considered a part of frontend.com and NOT a part of backend.com. So, if code.js contained XHR code to communicate with backend.com, it would fail.
Would it be any more safe with HTTPS? Example scenarios would be appreciated.
If you just converted your <script src="https://backend.com/code.js> to https, it would NOT be any secure. If the rest of your page is http, then an attacker could easily man-in-the-middle the page and change that https to http - or worse, include his own javascript file.
If you convert the entire page and all its components to https, it would be more secure. But if you are paranoid enough to do that, you should also be paranoid NOT to depend on an external server for you data. If an attacker compromises backend.com, he has effectively got enough leverage on frontend.com, frontend2.com and all of your websites.
In short, https is helpful, but it won't help you one bit if your backend server gets compromised.
So, what are my options?
Add a proxy server on each of your client applications. You don't need to write any code, your webserver can automatically do that for you. If you are using Apache, look up mod_rewrite
If your users are using the latest browsers, you could consider using Cross Origin Resource Sharing.
As The Rook pointed out, you could also use Flash + Crossdomain. Or you could use Silverlight and its equivalent of Crossdomain. Both technologies allow you to communicate with javascript - so you just need to write a utility function and then normal js code would work. I believe YUI already provides a flash wrapper for this - check YUI3 IO
What do you recommend?
My recommendation is to create a proxy server, and use https throughout your website.
Apologies to all who attempted to answer my question. It proceeded under a false assumption about how the script tag hack works. The assumption was that one could simply append a script tag to the DOM and that the contents of that appended script tag would not be restricted by the same origin policy.
If I'd bothered to test my assumption before posting the question, I would've known that it's the source attribute of the appended tag that's unrestricted. JSONP takes this a step further by establishing a protocol that wraps traditional JSON web service responses in a callback function.
Regardless of how the script tag hack is used, however, there is no way to screen the response for malicious code since browsers execute whatever JavaScript is returned. And neither IE, Firefox nor Webkit browsers check SSL certificates in this scenario. Doug Crockford is, so far as I can tell, correct. There is no safe way to do cross domain scripting as of JavaScript 1.8.5.
Our team have built a web application using Ruby on Rails. It currently doesn't restrict users from making excessive login requests. We want to ignore a user's login requests for a while after she made several failed attempts mainly for the purpose of defending automated robots.
Here are my questions:
How to write a program or script that can make excessive requests to our website? I need it because it will help me to test our web application.
How to restrict a user who made some unsuccessful login attempts within a period? Does Ruby on Rails have built-in solutions for identifying a requester and tracking whether she made any recent requests? If not, is there a general way to identify a requester (not specific to Ruby on Rails) and keep track of the requester's activities? Can I identify a user by ip address or cookies or some other information I can gather from her machine? We also hope that we can distinguish normal users (who make infrequent requests) from automatic robots (who make requests frequently).
Thanks!
One trick I've seen is having form fields included on the login form that through css hacks make them invisible to the user.
Automated systems/bots will still see these fields and may attempt to fill them with data. If you see any data in that field you immediately know its not a legit user and ignore the request.
This is not a complete security solution but one trick that you can add to the arsenal.
In regards to #1, there are many automation tools out there that can simulate large-volume posting to a given url. Depending on your platform, something as simple as wget might suffice; or something as complex (relatively speaking) a script that asks a UserAgent to post a given request multiple times in succession (again, depending on platform, this can be simple; also depending on language of choice for task 1).
In regards to #2, considering first the lesser issue of someone just firing multiple attempts manually. Such instances usually share a session (that being the actual webserver session); you should be able to track failed logins based on these session IDs ang force an early failure if the volume of failed attempts breaks some threshold. I don't know of any plugins or gems that do this specifically, but even if there is not one, it should be simple enough to create a solution.
If session ID does not work, then a combination of IP and UserAgent is also a pretty safe means, although individuals who use a proxy may find themselves blocked unfairly by such a practice (whether that is an issue or not depends largely on your business needs).
If the attacker is malicious, you may need to look at using firewall rules to block their access, as they are likely going to: a) use a proxy (so IP rotation occurs), b) not use cookies during probing, and c) not play nice with UserAgent strings.
RoR provides means for testing your applications as described in A Guide to Testing Rails Applications. Simple solution is to write such a test containing a loop sending 10 (or whatever value you define as excessive) login request. The framework provides means for sending HTTP requests or fake them
Not many people will abuse your login system, so just remembering IP addresses of failed logins (for an hour or any period your think is sufficient) would be sufficient and not too much data to store. Unless some hacker has access to a great many amount of IP addresses... But in such situations you'd need more/serious security measurements I guess.
If the owner of a web site wants to track who their users are as much as possible, what things can they capture (and how). You might want to know about this in order to capture information on a site you create or, as a user, to prevent a site from capturing data on you.
Here is a starting list, but I'm sure I have missed some important ones:
Referrer (what web page had the link you followed to get here). This is a HTTP header.
IP Address of the machine you are browsing from. This is available with the HTTP headers.
User Agent (what browser you are using). This is a HTTP header.
Cookie placed on a previous visit. This is a header, available only if a cookie was placed earlier and was not deleted by the user.
Flash Cookie placed on a previous visit. Some users turn off cookies, but very few know how to turn off Flash cookies. Works like a normal cookie although it depends on Flash.
Web Bugs. Place something small (like a transparent single-pixel GIF) on the page that's served up from a 3rd party. Some third parties (such as DoubleClick) will have their own cookies and can correlate with other visits the user makes (for a fee!).
Those are the common ones I think of, but there have to be LOTS of unusual ones. For instance, this:
Time on the user's clock. Use JavaScript to transmit it.
... which I had never heard of before reading it here.
ADDED LATER (after reading this):
Please try to put just ONE item per answer, then we can use voting up to sort out the better/more-interesting ones. The list below is probably less effective.
Ah well... NEXT time I ask a question like this I'll set it up better.
And here are some of the best answers I got:
James points out that IE transmits the .NET framework version.
AviewAnew points out that one can find what sites you have visited.
Mecki points out that Screen Resolution can be determined.
Mecki also points out that any auto-fill information your browser has cached can be determined, by creating a hidden field, then reading it with JavaScript.
jjrv points out that Flash can list the fonts on the user's machine.
Kent points out that you can find out what websites a person has visited.
Silver Dragon points out you can determine the location of the mouse within the browsing window using Flash and AJAX.
Jim points out that you can tell what language the user has configured in their browser from a HTTP header.
Jim also mentions that you can detect whether people are using Greasemonkey or something similar to modify the page.
Modifications to your original:
can be escaped ( i think its an option in some browsers )
only avoidable with a proxy ( javascript can contravene this however with smart lookaround )
is unreliable, easily forged.
And assuming it was not wiped by browser closure ( session cookie ) and cookie is in the same domain/path
The real nasty ones are
Using javascript to probe your network/lan
Using javascript to access your firewall from behind the firewall and adjust its settings ( no joke )
Using the feature of the "visited link" to determine which of a list of urls have been visited. ( deep history probing ! )
Goodness knows what if the user has Windows/IE/ActiveX
There's a header that can include information about a proxy server the user is using, and that can also include the user's IP address (in which case the other IP is the one of the proxy)
Screen Resolution, Operating System, Color Depth, size of your taskbar (compare max and current resolution), if Java is enabled, Anti-Aliasing Fonts, Plugins Installed all via Javascript
A Java applet can give you a bunch of information as well, but I don't know what.
Sites you've visited
Details of your local network such as active hosts, web servers. Paper Also outlines drive-by printing, drive-by router modification
And this is all assuming the attacker doesn't pull off arbitrary code execution
Javascript can get more information than just time. E.g. screen resolution (+ color depth) being one of them.
See Getting Screen Resolution with JS
Everything JS can capture, can be transmitted using AJAX without the user performing any interaction. Other examples are (not all will work in every browser):
It can look into your browser history, e.g. what URL your browser would go if you hit back or forward.
The language of your browser (Note: usually the HTTP request will also contain a list of preferred languages for the page you request. However this list is user editable in the prefs of many browser, while JS can actually find out what the language translation your browser is using in the interface)
If your browser auto fills form fields (e.g. e-mail, username, etc.), JS can actually already read what your browser entered into the fields before you submitted the form (thus it can even read what your browser pre-filled there, even if you never submit the form at all).
A Java applet could also gather some information and transmit it, though there is not much information you wouldn't already get elsewhere. Since it's easy to get the IP of a visitor, it's possible to find out which online service he's using (looking up the IP at address services like IANA for USA or RIPE for Europe and so on) and there are services that translate IPs to country, so it's possible to find out where the user most likely is currently located.
Some additional info, that might be of interest:
Using the ip address, one can resolve the hostname, net provider / organization the IP belongs to, and rough geographic location.
Using the referer, the list of queries a specified client makes, and a reliable cookie mechanism, one can resolve the path the visitor makes (even clickthroughs to other sides, with AJAX and/or a forwarder page)
Using flash, with a combination of AJAX, the mouse location within the browsing window can be captured
The User Agent might contain information regarding operation system, installed .NET frameworks, and other curiosities
.NET framework versions are transmitted in IE, in the User Agent.
Flash can give you a list of fonts on the user's machine among other things. Javascript can send information when the mouse stops over an ad without clicking it. You can also get the window size, whether the site is open in a frame, if popups or specific plugins have been blocked, looking for Javascript features can tell if the user agent header is correct or faked...
If you're concerned about your personal security (I'm not sure if that's what you're really getting after, so my apologies if this is misguided), you can always use a Tor network. If you use Firefox, you can use Torbutton for one click enabling. It has the benefit (drawback, to some), of disabling Flash because it's otherwise impossible to protect against Flash information leaks.
You can usually determine which language the user speaks through the Accept-Language HTTP header.
You can determine whether certain applications and browser plugins are installed by looking at the Accept HTTP header.
Browser version/patchlevel and .NET framework version through the User-Agent HTTP header.
Your ISP/Employer and geographical location through IP address.
Whether or not you have visited particular URLs through CSS and/or timing load events. If a particular website has user-specific URIs, this could disclose whether you are a certain user on that site or not.
Which fonts are available through measuring ems and/or Flash.
Screen resolution, window size, timezone through JavaScript.
Where you move your mouse and keystrokes through JavaScript. For instance, you can see what people type into text boxes even if they don't hit submit.
Many UserJS/Greasemonkey scripts leak information (e.g. if you filter out certain people, the sites it is configured for may be able to find out who).
Can the browser support JS
Can the browser support flash
Operating system platform
Screen resolution
Supports CSS
Supports tables
I need to dig up the link, but if the user is using IE, with common software titles installed, determining which ones are installed is possible.
As far as I know, it's possible to get clipboard data via javascript. Not sure how possible it is by default these days, but it was all the rage not long ago. I do believe IE still allows it.
People have a habit of leaving very important data in their clipboard, so this is pretty bad.
late to the party here, the website can also scan your ports, to find what software you are running!