How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.
I have been looking around SO and other on-line resources but cant seem to locate how this is done. I was wondering how things like magnet links worked on torrent website. They automatically open up and application and pass the appropriate params. I was wondering how could I create one to send a custom program params from the net?
Thanks
s654m
I wouldn't say this is an answer, but it is actually too long for a comment to fit.
Apps tend to register as authorities that can open a specific scheme. I don't know how it's done in desktop apps (especially because depending on each OS, it will vary), but on Android you can catch schemes or base urls by Intent Filters.
The way it works (and I'm pretty sure the functionality is cross-OS) is:
Your app tells the system it can "read" a specific scheme or base url (it could be magnet:// or even http://www.twitter.com/).
When you try to open a URI (Uniform resource identifier, a supergroup that can contain URLs), the system searches for any application that was registered for that kind of URI. I guess it runs from more specific and complete formats to the base. So for instance, this tweet: https://twitter.com/korcholis/status/491724155176222720 may be traced in this order:
https://twitter.com/korcholis/status/491724155176222720 Oh, no registrar? Moving on
https://twitter.com/korcholis/status Nothing yet? Ok
https://twitter.com/korcholis Nnnnnnope?
https://twitter.com Anybody? Ah, you, Totally random name for a Twitter Client know how to handle these links? Then it's yours
This random twitter client gets the full URI and does something accordingly.
As you see, nobody had a chance to track https://, since another application caught the URI before them. In this case, nobody could be your browsers.
It also defines, somehow, a default value. This is the true key why browsers tend to battle to be your default browser of choice. This just tells you they want to be the default applications that catch http://, https:// and probably some more.
The true wonder here is that, as long as there's an app that catches a scheme, you can set the one you want. For instance, it's a common practice that apps from the same developer contain the same schemes, in case the developer wants to share tasks between them. This ensures the user will have to use a group of apps. So, one app can just offer data such as:
my-own-scheme://user/12
While another app is registered to get links that start with
my-own-scheme://
So, if you want to make your own schemes, it's ok, as long as they don't collide with other's. And if you want to read other's schemes, well, that's up to you to search for that. See? This is not a real answer, but I hope it removes almost all doubt.
I have a web application that has some pretty intuitive URLs, so people have written some Chrome extensions that use these URLs to make requests to our servers. Unfortunately, these extensions case problems for us, hammering our servers, issuing malformed requests, etc, so we are trying to figure out how to block them, or at least make it difficult to craft requests to our servers to dissuade these extensions from being used (we provide an API they should use instead).
We've tried adding some custom headers to requests and junk-json-preamble to responses, but the extension authors have updated their code to match.
I'm not familiar with chrome extensions, so what sort of access to the host page do they have? Can they call JavaScript functions on the host page? Is there a special header the browser includes to distinguish between host-page requests and extension requests? Can the host page inspect the list of extensions and deny certain ones?
Some options we've considered are:
Rate-limiting QPS by user, but the problem is not all queries are equal, and extensions typically kick off several expensive queries that look like user entered queries.
Restricting the amount of server time a user can use, but the problem is that users might hit this limit by just navigating around or running expensive queries several times.
Adding static custom headers/response text, but they've updated their code to mimic our code.
Figuring out some sort of token (probably cryptographic in some way) we include in our requests that the extension can't easily guess. We minify/obfuscate our JS, so are ok with embedding it in the JS source code (since the variable name it would have would be hard to guess).
I realize this may not be a 100% solvable problem, but we hope to either give us an upper hand in combatting it, or make it sufficiently hard to scrape our UI that fewer people do it.
Welp, guess nobody knows. In the end we just sent a custom header and starting tracking who wasn't sending it.
I'm thinking about adding another static server to a web app, so I'd have static1.domain.tld and static2.domain.tld.
The point would be to use different domains in order to load static content faster ( more parallel connections at the same time ), but what 'troubles' me is "how to get user's browser to see static1.domain.tld/images/whatever.jpg and static2.domain.tld/images/whatever.jpg as the same file" ?
Is there a trick to accomplish this with headers or I'll have to define which file is on which server?
No, there's no way to tell the browser that two URLs are the same -- the browser caches by full URL.
What you can do is make sure you always use the same url for the same image. Ie. all images that start with A-M go on server 1, N-Z go on server 2. For a real implementation, I'd use a hash based on the name or something like that, but there's probably libraries that do that kind of thing for you.
You need to have both servers able to respond to requests sent to static.domain.tld. I've seen a number of ways of achieving this, but they're all rather low level. The two I'm aware of:
Use a DNS round-robin so that the mapping of hostnames to IP addresses changes over time; very large websites often use variations on this so that content is actually served from a CDN closer to the client.
Use a hacked router config so that an IP address is answered by multiple machines (with different MAC addresses); this is very effective in practice, but requires the machines to be physically close.
You can also do the spreading out at the "visible" level by directing to different servers based on something that might as well be random (e.g., a particular bit from the MD5 hash of the path). And, best of all, all of these techniques use independent parts of the software stack to work; you can use them in any combination you want.
This serverfault question will give you a lot of information:
Best way to load balance across multiple static file servers for even an bandwidth distribution?
If the owner of a web site wants to track who their users are as much as possible, what things can they capture (and how). You might want to know about this in order to capture information on a site you create or, as a user, to prevent a site from capturing data on you.
Here is a starting list, but I'm sure I have missed some important ones:
Referrer (what web page had the link you followed to get here). This is a HTTP header.
IP Address of the machine you are browsing from. This is available with the HTTP headers.
User Agent (what browser you are using). This is a HTTP header.
Cookie placed on a previous visit. This is a header, available only if a cookie was placed earlier and was not deleted by the user.
Flash Cookie placed on a previous visit. Some users turn off cookies, but very few know how to turn off Flash cookies. Works like a normal cookie although it depends on Flash.
Web Bugs. Place something small (like a transparent single-pixel GIF) on the page that's served up from a 3rd party. Some third parties (such as DoubleClick) will have their own cookies and can correlate with other visits the user makes (for a fee!).
Those are the common ones I think of, but there have to be LOTS of unusual ones. For instance, this:
Time on the user's clock. Use JavaScript to transmit it.
... which I had never heard of before reading it here.
ADDED LATER (after reading this):
Please try to put just ONE item per answer, then we can use voting up to sort out the better/more-interesting ones. The list below is probably less effective.
Ah well... NEXT time I ask a question like this I'll set it up better.
And here are some of the best answers I got:
James points out that IE transmits the .NET framework version.
AviewAnew points out that one can find what sites you have visited.
Mecki points out that Screen Resolution can be determined.
Mecki also points out that any auto-fill information your browser has cached can be determined, by creating a hidden field, then reading it with JavaScript.
jjrv points out that Flash can list the fonts on the user's machine.
Kent points out that you can find out what websites a person has visited.
Silver Dragon points out you can determine the location of the mouse within the browsing window using Flash and AJAX.
Jim points out that you can tell what language the user has configured in their browser from a HTTP header.
Jim also mentions that you can detect whether people are using Greasemonkey or something similar to modify the page.
Modifications to your original:
can be escaped ( i think its an option in some browsers )
only avoidable with a proxy ( javascript can contravene this however with smart lookaround )
is unreliable, easily forged.
And assuming it was not wiped by browser closure ( session cookie ) and cookie is in the same domain/path
The real nasty ones are
Using javascript to probe your network/lan
Using javascript to access your firewall from behind the firewall and adjust its settings ( no joke )
Using the feature of the "visited link" to determine which of a list of urls have been visited. ( deep history probing ! )
Goodness knows what if the user has Windows/IE/ActiveX
There's a header that can include information about a proxy server the user is using, and that can also include the user's IP address (in which case the other IP is the one of the proxy)
Screen Resolution, Operating System, Color Depth, size of your taskbar (compare max and current resolution), if Java is enabled, Anti-Aliasing Fonts, Plugins Installed all via Javascript
A Java applet can give you a bunch of information as well, but I don't know what.
Sites you've visited
Details of your local network such as active hosts, web servers. Paper Also outlines drive-by printing, drive-by router modification
And this is all assuming the attacker doesn't pull off arbitrary code execution
Javascript can get more information than just time. E.g. screen resolution (+ color depth) being one of them.
See Getting Screen Resolution with JS
Everything JS can capture, can be transmitted using AJAX without the user performing any interaction. Other examples are (not all will work in every browser):
It can look into your browser history, e.g. what URL your browser would go if you hit back or forward.
The language of your browser (Note: usually the HTTP request will also contain a list of preferred languages for the page you request. However this list is user editable in the prefs of many browser, while JS can actually find out what the language translation your browser is using in the interface)
If your browser auto fills form fields (e.g. e-mail, username, etc.), JS can actually already read what your browser entered into the fields before you submitted the form (thus it can even read what your browser pre-filled there, even if you never submit the form at all).
A Java applet could also gather some information and transmit it, though there is not much information you wouldn't already get elsewhere. Since it's easy to get the IP of a visitor, it's possible to find out which online service he's using (looking up the IP at address services like IANA for USA or RIPE for Europe and so on) and there are services that translate IPs to country, so it's possible to find out where the user most likely is currently located.
Some additional info, that might be of interest:
Using the ip address, one can resolve the hostname, net provider / organization the IP belongs to, and rough geographic location.
Using the referer, the list of queries a specified client makes, and a reliable cookie mechanism, one can resolve the path the visitor makes (even clickthroughs to other sides, with AJAX and/or a forwarder page)
Using flash, with a combination of AJAX, the mouse location within the browsing window can be captured
The User Agent might contain information regarding operation system, installed .NET frameworks, and other curiosities
.NET framework versions are transmitted in IE, in the User Agent.
Flash can give you a list of fonts on the user's machine among other things. Javascript can send information when the mouse stops over an ad without clicking it. You can also get the window size, whether the site is open in a frame, if popups or specific plugins have been blocked, looking for Javascript features can tell if the user agent header is correct or faked...
If you're concerned about your personal security (I'm not sure if that's what you're really getting after, so my apologies if this is misguided), you can always use a Tor network. If you use Firefox, you can use Torbutton for one click enabling. It has the benefit (drawback, to some), of disabling Flash because it's otherwise impossible to protect against Flash information leaks.
You can usually determine which language the user speaks through the Accept-Language HTTP header.
You can determine whether certain applications and browser plugins are installed by looking at the Accept HTTP header.
Browser version/patchlevel and .NET framework version through the User-Agent HTTP header.
Your ISP/Employer and geographical location through IP address.
Whether or not you have visited particular URLs through CSS and/or timing load events. If a particular website has user-specific URIs, this could disclose whether you are a certain user on that site or not.
Which fonts are available through measuring ems and/or Flash.
Screen resolution, window size, timezone through JavaScript.
Where you move your mouse and keystrokes through JavaScript. For instance, you can see what people type into text boxes even if they don't hit submit.
Many UserJS/Greasemonkey scripts leak information (e.g. if you filter out certain people, the sites it is configured for may be able to find out who).
Can the browser support JS
Can the browser support flash
Operating system platform
Screen resolution
Supports CSS
Supports tables
I need to dig up the link, but if the user is using IE, with common software titles installed, determining which ones are installed is possible.
As far as I know, it's possible to get clipboard data via javascript. Not sure how possible it is by default these days, but it was all the rage not long ago. I do believe IE still allows it.
People have a habit of leaving very important data in their clipboard, so this is pretty bad.
late to the party here, the website can also scan your ports, to find what software you are running!