fail2ban force me to ban google because of /forward in my log - googlebot

In my apache log, I have a lot of stuff like this:
<IP ADDRESS> - - <DATE> "GET /forward?path=http://vary_bad_link_not_for_children" <NUM1> <NUM2> "-" <String>
<NUM1>: 302 or 404
<NUM2>: 5XX, 6XX or 11XX
<String>:
"Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
"Mozilla/5.0 (compatible; Googlebot/2.1; +...a link)"
"Mozilla/5.0 (compatible; Exabot/3.0; +...a link)"
etc...
I have made a jail for fail2ban with this regex:
failregex = ^<HOST> .*"GET .*/forward\?path=
Everything is working fine except that the IP address that are banned (see <IP ADDRESS> in the log) are the IP of google and other very well known companies.
I really don't understand why it is like this; I mean why should I ban google and the other companies and If not, Why should I accept all those inappropriate request to my server.
I would like to clarify my questions, as it was poorly explained:
1-Why Google IP (and other known companies) are doing those kind of "porn" requests
2-Is there any meaning to "/forward?path=..." is it an apache feature?
3-How to handle this problem without stopping the "good" bots to reference my sites.
Thanks by advance for any help!

You can tell robots not to visit parts of your site in your robots.txt.
Adding
User-agent: *
Disallow: /forward
to your robots.txt will keep all bots away from visiting all pages beginning with /forward. They will continue to visit and index other pages.
If you want to allow /forward?path=something_nice but not /forward?path=very_bad_link, you can do that:
User-agent: *
Disallow: /forward?path=a_specific_bad_link
Disallow: /forward?path=another_bad_link
Why are bots making these requests?
This may be entirely innocent. Perhaps someone has mistakenly linked to your site, perhaps the page used to exist and no longer does.
This may be due to a link on your own site that points to this URL. Check for that.
In the worst case, it might be people using you as an unwitting proxy. Make sure that the server does not serve anything when /forward is requested, and check the logs for anything else suspicious.
What if the requests continue?
It may take a while for the requests to stop. Robots do not request your robots.txt every time, and you will have to wait for them to update.
However, if they don't eventually stop, it means they are malicious bots, and spoofing the Googlebot user-agent. robots.txt provides instructions to the robot. Good-willed bots honour them, but they can't force a malicious robot to stay away. You then need a solution like fail2ban.

Related

Should I block this Ip range?

There is a crawler on my site that does not identify as a robot in it's user agent.
one of the ip addresses is:
131.161.8.197
All of the bots belong in the Ip range of 131.161.
Apparently it is a "brasil baidu" based on an ipwhois.
Should I just go ahead and block that entire range of ips?
So it's originating from Brasil, the question really is.. do you need to target the Brasil area?
Blocking the crawler will mean that you have less traffic to deal with, so I personally would say yes to blocking it.
You can either use your robots.txt or do it via server side. Obviously you can use the:
Order Deny,Allow
Deny from 131.161.8.197
or:
User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /

Classic ASP: ServerVariables["HTTP_HOST"] forging

According to this MSDN article https://msdn.microsoft.com/en-us/library/ms525396(v=vs.90).aspx these variables are set based on headers. I'm curious if the HTTP_HOST variable is spoofable. I've run a few tests that indicate it's not spoofable, but I'd like to be sure.
EDIT: For clarity, I'm curious if something like a server proxy, man in the middle, or just someone who knows how to use netcat could forge the appropriate headers in order to manipulate HTTP_HOST within my scripts.
Yes it is spoofable.
A malicious user for your website can set any host header they want in their HTTP request and the HTTP_HOST variable will reflect that:
GET / HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
The only caveat is that it must be bound on your webserver. For example, if you use an IIS webserver you specify which hosts your website is bound to. This can be blank for "any" or you can set it to a specific domain name. If the latter, then at attacker sending
GET / HTTP/1.1
Host: www.foo.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
to your example.com domain would not hit your example.com website but would hit the default IIS website if that itself was running and if this has a blank binding. If it wasn't running then an error is returned instead. These measures will protect you from spoofing in attacks such as cache poisoning or spoofing malicious password reset emails.
Also note that HTTPS sites do not use the host header. They either bind directly to a single IP, or they use Server Name Indication (SNI), which is sent as part of the TLS/SSL handshake to determine which website the request is made to. IIS 8 and above support SNI and this introduces a per binding certificate that is interpreted in the same way as the host header for plain HTTP. Note that this can be spoofed in the same way by an attacker at the browser end because they can send whichever domain name they want.
However, SNI information cannot be altered by a Man-In-The-Middle attacker like it can with a plain HTTP request. This is because the browser will check that the domain name matches the requested site and will warn the user if this is not the case. There is no such authentication with plain HTTP. The only attack I could think of in a MITM scenario with HTTPS is one where wildcard certificates are used and that a MITM could make the user hit a different site than expecting with no browser warning. However the TLS handshake FINISHED message hash would not calculate if this had been altered by a MITM, so that should mitigate this attack.

Does https encrypt the whole URL?

I googled a lot and many answers are Yes. For example: Is GET data also encrypted in HTTPS? But the senior security engineer in our company told me the URL would not be encrypted.
Image that, if the URL was encrypted, how does the DNS server find the host and connect?
I think is this is very strong point although it's against most of the answers. So I'm really confused and my questions are:
Does https encrypt the everything in the request? (including the URL, host, path, parameters, headers)
If yes, how the DNS server decrypt the request and send it to the host server?
I tried to access https://www.amazon.com/gp/css/homepage.html/ref=ya_surl_youracct and my IE sent two requests to the server:
First:
CONNECT www.amazon.com:443 HTTP/1.0
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Host: www.amazon.com
Content-Length: 0
DNT: 1
Connection: Keep-Alive
Pragma: no-cache
Second:
GET /gp/css/homepage.html/ref=ya_surl_youracct HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-US,zh-CN;q=0.5
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Host: www.amazon.com
DNT: 1
Connection: Keep-Alive
It seems my browser has requested twice: the first time is to establish the connection with host (without encryption) and the second time send an encrypted request over https? Am I right? If I am understanding this correctly, when a client call the RESTFUL API using https, it sends the requests (connection and get/post) twice every time?
The URL IS encrypted from the time it leaves the browser until it hits the destination server.
What happens is that the browser extracts the domain name and the port from the URL and uses that to resolve DNS itself. Then it starts an encrypted channel to the destination server IP:port. Then it sends a HTTP request through that encrypted channel.
The important part is anyone but you and the destination server can only see that you're connecting to a specific IP address and port. They can't tell anything else (like specific URLs, GET parameters, etc).
Attackers can't even see the domain in most cases (though they can infer it if there is actually a DNS lookup - if it wasn't cached).
The big thing to understand is that DNS (Domain Name Service) is a completely different service with a different protocol from HTTP. The browser makes DNS lookup requests to convert a domain name into an IP address. Then it uses that IP address to issue a HTTP request.
But at no time does the DNS server receive a HTTP request, and at no time does it actually do anything other than provide a domain-name - IP mapping for users.
While the other responses are correct so far as they go, there are many other considerations than just the encryption between the browser and the server. Here are some things to think about...
The IP address of the server is resolved.
The browser makes a TCP socket connection to the server's IP address using TLS. This is the CONNECT you see in your example.
The request is sent to the server over the encrypted session.
If this was all there is to it, you are done. No problem.
But wait, there's more!
Having the fields in a GET instead of a POST reveals sensitive data when...
Someone looks in the server logs. This might be a snoopy employee, but it can also be the NSA or other three-letter government agency, or the logs might become public record if subpoenaed in a trial.
An attacker causes the web site encryption to fall back to cleartext or a broken cipher. Have a look at the SSL checker from Qualsys labs to see if a site is vulnerable to this.
Any link on the page to an external site will show the URI of the page as the referrer. User ID and passwords are unintentionally yet commonly given away in this fashion to advertising networks. I sometimes spot these in my own blog.
The URL is available in the browser history and therefore accessible to scripts. If the computer is public (someone checks your web site from the guest PC in the hotel or airport lounge) the GET request leaks data to anyone else using that device.
As I mentioned, I sometimes find IDs, passwords and other sensitive info in the referrer logs of my blogs. In my case, I contact the owner of the referring site and tell them they are exposing their users to hacking. A less scrupulous person would add comments or updates to the site with links to their own web site, with the intention of harvesting the sensitive data in their referrer logs.
So your company's senior security engineer is correct that the URL is not encrypted in many places where it is extremely important to do so. You and the other respondents are also correct that it is encrypted in the very narrow use case of the browser talking to the server in context of a TLS session. Perhaps the confusion you mention has to do with the difference in the scope of these two use cases.
Please see also:
Testing for Exposed Session Variables (OTG-SESS-004)
Session Management - How to protect yourself (Note that "always use POST" is repeated over and over on this page.)
Client account hijacking through abusing session fixation on the provider
The URL (also known as "Uniform Resource Locator") contains four parts:
Protocol (e.g. https)
Host name (e.g. stackoverflow.com)
Port (not always included, typically 80 for http and 443 for https)
Path and file name or query
Some examples:
ftp://www.ftp.org/docs/test.txt
mailto:user#test101.com
news:soc.culture.Singapore
telnet://www.test101.com/
The URL as an entire unit is not actually encrypted because it is not passed in its entirety. The URL is actually pulled apart into bits and each part is used in different ways. E.g. the protocol portion will tell your browser how to use the rest of the URL, the host name will tell it how to look up the IP address of the intended recipient, and the port will tell it, well, which port to use. The only portion of the URL that is passed in the payload itself is the path and query, and that portion is encrypted.
If you take a look at an HTTP request in the raw, it looks something like this:
GET /docs/index.html HTTP/1.1
Host: www.test101.com
Accept: image/gif, image/jpeg, */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
(blank line)
--Body goes here--
What you see in the example above is passed. Notice the full URL appears nowhere. The host header can actually be omitted completely (it is not used for routing). The only portion of the URL that appears here is to the right of the GET verb, and only includes the rightmost portion of the original URL. The protocol and the port number appear nowhere in the message itself.
Short answer: Everything to the right of the port number in the URL is included in the payload of the https request and is in fact encrypted.

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Is this googlebot or someone trying to impersonate googlebot?

On my elmah exceptions i keep getting exceptions of what appears to be googlebot but what I imagine is someone impersonating themselves trying to download what appears to be wares and other dodgy software from my server.
Here are just a few of the attempts and the software they are trying to get.
The controller for path '/download/msjavx86.exe' was not found
/downloads/IEZawGyiGtalkfont.EXE'
/downloads/alphazawgyiremover.exe
/downloads/gtalkmyanmaraddinremover.exe'
/cgi-bin/irbis32r/cgiirbis_32.exe
/ticker/MBISetup.exe'
The user agent and remote host are always the same
REMOTE_HOST 66.249.65.163
HTTP_USER_AGENT Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
So my question is, is this googlebot searching for malware , or someone having a go at my server ??
I guess Yes. Google does scan websites for safe search listing. Malware scan Based on you server software is part of it.

Resources