I'm not quite sure whether this is the suitable forum to post my question. I'm analyzing web server logs both in Apache and IIS log formats. I want to find the evidences for automatic browsing(Ex. Web robots,spiders,bots etc.) I used python robot-detection 0.2.8 for detecting robots in my log files. Anyway there may be other robots(automatic programs) which have traversed through the web site but robot-detection can not identify.
So are there any specific clues that can be found in log files(that human users do not perform but software perform actions etc)?
Do they follow a specific navigation pattern?
I saw some requests for favicon.ico? Does this implicate that it is a automatic browsing?.
I found this article with some valuable points.
The article on how to identify robots has some good information. Other things you might consider.
If you see a request for an HTML page, but it isn't followed by requests for the images or script files that the page uses, it's very likely that the request came from a crawler. If you see lots of those from the same IP address, it's almost certainly a crawler. It could be the Lynx browser (text only), but it's more likely a crawler.
It's pretty easy to spot a crawler that scans your entire site very quickly. But some crawlers go more slowly, waiting 5 minutes or more between page requests. If you see multiple requests from the same IP address, spread out over time but at very regular intervals, it's probably a crawler.
Repeated 403 (Unauthorized) entries in the log from the same IP. It's rare that a human will suffer through more than a handful of 403 errors before giving up. An unsophisticated crawler will blindly try URLs on the site, even if it gets dozens of 403s.
Repeated 404's from the same IP address. Again, a human will give up after some small number of 404s. A crawler will blindly push on ... "I know there's a good URL in here somewhere."
A user-agent string that isn't one of the major browsers' agent strings. If the user-agent string doesn't look like a browser's user agent string, it's probably a bot. Note that the reverse isn't true; many bots set the user agent string to a known browser user agent string.
Related
I have a Wordpress site running on IIS site which returns a hard 403 response for certain user agent strings (IE 6 - 10 and Firefox 33 Mac and Win). After some poking around by manually changing the user agent string I determined that the strings Trident/4 , Trident/5 , and Firefox/3 all cause this site to exhibit this behavior. There might be other combinations, but clearly this is something going on in either the code or the IIS level.
I've scanned the code at a high level and found some mentions of both Firefox and Trident, all related to user agent sniffing, but they appear to be core Wordpress files and not app specific code. I've been searching all afternoon and the only things I find mention of are telling the user to adjust the "directory browsing" settings in web.config. However I can replicate the behavior by directly accessing a static asset such as a CSS file. That tells me not only is it not related to directory browsing, but it's probably also not anything to do with application code.
Can anyone offer insight into what might be happening here? To head off questions:
We just noticed this behavior a few weeks ago, unsure of how long it's been going on.
I'll be checking access/error logs as soon as I can get them.
EDIT
Turns out that the previous developers had added some very specific URL rewriting rules for the site. They were explicitly returning a 403 for any user agent with the patterns I listed above, along with a few other generic patterns and some specific botnames. I knew it had to be something with the web server...we just had to poke around in IIS long enough to find them.
I have in my main website root the file...
lib.php
So hackers keeps hitting my website with different IP addresses, different OS, different everything. The page is redirected to our 404 error page, and this 404 error page tracks visitors using standard visitor tracking analytics do allow us to see problems as they may arise.
Below is an example of the landing pages as shown in analytics by the hackers, except that I get about 200 hits per hour. Each link is a bit different as they are using a variable to set as a page url to goto.
mysite.com/lib.php?id=zh%2F78jQrm3qLoE53KZd2vBHtPFaYHTOvBijvL2NNWYE%3D
mysite.com/lib.php?id=WY%2FfNHaB2OBcAH0TcsAEPrmFy1uGMHgxmiWVqT2M6Wk%VD
mysite.com/lib.php?id=WY%2FfNHaB2OBcAH0TcsAEPrmFy1uGMHgxmiWVqJHGEWk%T%
mysite.com/lib.php?id=JY%2FfNHaB2OBcAH0TcsAEPrmFy1uGMHgxmiWVqT2MFGk%BD
I do not think I even need the file http://www.mysite.com/lib.php
Should I need it? When I visit mysite.com/lib.php it is redirected to my custom 404 page.
How can I stop this best? I am thinking by using .htaccess, but not sure the best setup?
This is most probably part of the Asprox botnet.
http://rebsnippets.blogspot.cz/asprox
Key thing is to change your password and stop using FTP protocol to access your privileged accounts.
I need to know how to prevent repetitive file downloads using .htaccess, or if not via .htaccess than some other method.
A site I maintain had over 9,000 hits on a single PDF file, accounting for over 80% of the site's total bandwidth usage, and I believe that most of the hits were from the same IP address. I've banned the IP, but that's obviously not an effective solution because there are always proxies and besides, I can only do that after the fact.
So what I want to do is cap the number of times a single IP can attempt to download a designated file or file type over a given period of time. Can this be done with .htaccess? If not, what other options do I have?
EDIT: The suggestion that I redirect requests to a server-side script that would track requests by IP via database sounds like a good option. Can anyone recommend an existing script or library?
If you have some server side code stream out your files, you have the opportunity to control what's being sent. I'm not aware of a .htaccess solution (which doesn't mean there isn't one).
My background is in Microsoft products, so I'd write a bit of ASP.NET code that would accept the filename as a parameter & stream back the result. Since it would be my code, I could easily access a database to track which IP I was serving, how often the file was sent, etc.
This could easily be done using any server side tech - PHP, etc.
I want to implement https on only a selection of my web-pages. I have purchased my SSL certificates etc and got them working. Despite this, due to speed demands i cannot afford to place them on every single page.
Instead i want my server to serve up http or https depending on the page being viewed. An example where this has been done is ‘99designs’
The problem in slightly more detail:
When my visitors first visit my site they only have access to non-sensitive information and therefore i want them to be presented with simple http.
Then once they login they are granted access to more sensitive information, e.g. profile information for which HTTPS is used to deliver.
Despite being logged in, if the user goes back to a non-sensitive page such as the homepage then i want it delivered using HTTP.
One common solution seems to be using the .htaccess file. The problem is that my site is relatively large meaning that to use this would require me to write a rule for every page (several hundred) to determine whether it should be server up using http or https.
And then there is the problem of defining user generated content pages.
Please help,
Many thanks,
David
You've not mentioned anything about the architecture you are using. Assuming that the SSL termination is on the webserver, then you should set up separate virtual hosts with completely seperate and non-overlapping document trees, and for preference, use a path schema which does not overlap (to avoid little accidents).
I wonder how some video streaming sites can restrict videos to be played only on certain domains. More generally, how do some websites only respond to requests from certain domains.
I've looked at http://en.wikipedia.org/wiki/List_of_HTTP_header_fields and saw the referrer field that might be used, but I understand that HTTP headers can be spoofed (can they?)
So my question is, can this be done at the application level? By application, I mean, for example, web applications deployed on a server, not a network router's operating system.
Any programming language would work for an answer. I'm just curious how this is done.
If anything's unclear, let me know. Or you can use it as an opportunity to teach me what I need to know to clearly specify the question.
HTTP Headers regarding ip-information are helpful (because only a smaller portion is faked) but is not reliable. Usually web-applications are using web-frameworks, which give you easy access to these.
Some ways to gain source information:
originating ip-address from the ip/tcp network stack itself: Problem with it is that this server-visible address must not match the real-clients address (it could come from company-proxy, anonymous proxy, big ISP... ).
HTTP X-Forwarded-For Header, proxies are supposed to set this header to solve the mentioned problem above, but it also can be faked or many anonymous proxies aren't setting it at all.
apart from ip-source information you also can use machine identifiers (some use the User-Agent Header. Several sites for instance store this machine identifiers and store it inside flash cookies, so they can reidentify a recalling client to block it. But same story: this is unreliable and can be faked.
The source problem is that you need a lot of security-complexity to securely identify a client (e.g. by authentication and client based certificates). But this is high effort and adds a lot of usability problem, so many sites don't do it. Most often this isn't an issue, because only a small portion of clients are putting some brains to fake and access server.
HTTP Referer is a different thing: It shows you from which page a user was coming. It is included by the browser. It is also unreliable, because the content can be corrupted and some clients do not include it at all (I remember several IE browser version skipping Referer).
These type of controls are based on the originating IP address. From the IP address, the country can be determined. Finding out the IP address requires access to low-level protocol information (e.g. from the socket).
The referrer header makes sense when you click a link from one site to another, but a typical HTTP request built with a programming library doesn't need to include this.