Should I block this Ip range? - .htaccess

There is a crawler on my site that does not identify as a robot in it's user agent.
one of the ip addresses is:
131.161.8.197
All of the bots belong in the Ip range of 131.161.
Apparently it is a "brasil baidu" based on an ipwhois.
Should I just go ahead and block that entire range of ips?

So it's originating from Brasil, the question really is.. do you need to target the Brasil area?
Blocking the crawler will mean that you have less traffic to deal with, so I personally would say yes to blocking it.
You can either use your robots.txt or do it via server side. Obviously you can use the:
Order Deny,Allow
Deny from 131.161.8.197
or:
User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /

Related

fail2ban force me to ban google because of /forward in my log

In my apache log, I have a lot of stuff like this:
<IP ADDRESS> - - <DATE> "GET /forward?path=http://vary_bad_link_not_for_children" <NUM1> <NUM2> "-" <String>
<NUM1>: 302 or 404
<NUM2>: 5XX, 6XX or 11XX
<String>:
"Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
"Mozilla/5.0 (compatible; Googlebot/2.1; +...a link)"
"Mozilla/5.0 (compatible; Exabot/3.0; +...a link)"
etc...
I have made a jail for fail2ban with this regex:
failregex = ^<HOST> .*"GET .*/forward\?path=
Everything is working fine except that the IP address that are banned (see <IP ADDRESS> in the log) are the IP of google and other very well known companies.
I really don't understand why it is like this; I mean why should I ban google and the other companies and If not, Why should I accept all those inappropriate request to my server.
I would like to clarify my questions, as it was poorly explained:
1-Why Google IP (and other known companies) are doing those kind of "porn" requests
2-Is there any meaning to "/forward?path=..." is it an apache feature?
3-How to handle this problem without stopping the "good" bots to reference my sites.
Thanks by advance for any help!
You can tell robots not to visit parts of your site in your robots.txt.
Adding
User-agent: *
Disallow: /forward
to your robots.txt will keep all bots away from visiting all pages beginning with /forward. They will continue to visit and index other pages.
If you want to allow /forward?path=something_nice but not /forward?path=very_bad_link, you can do that:
User-agent: *
Disallow: /forward?path=a_specific_bad_link
Disallow: /forward?path=another_bad_link
Why are bots making these requests?
This may be entirely innocent. Perhaps someone has mistakenly linked to your site, perhaps the page used to exist and no longer does.
This may be due to a link on your own site that points to this URL. Check for that.
In the worst case, it might be people using you as an unwitting proxy. Make sure that the server does not serve anything when /forward is requested, and check the logs for anything else suspicious.
What if the requests continue?
It may take a while for the requests to stop. Robots do not request your robots.txt every time, and you will have to wait for them to update.
However, if they don't eventually stop, it means they are malicious bots, and spoofing the Googlebot user-agent. robots.txt provides instructions to the robot. Good-willed bots honour them, but they can't force a malicious robot to stay away. You then need a solution like fail2ban.

Blocking a City or Region from website

I know it is easy to block an individual IP address or a whole country from viewing my website via htaccess, however I need to block a city in the UK only and have the visitors from the blocked city redirected to another external URL.
Here is some code I already have for my htaccess file, but I have been searching everywhere on how to block just a UK city or region. Where would I find the range of IPs for a specific UK city? or is there a better way of doing this?
# BAN USER BY IP
<Limit GET POST>
order allow,deny
allow from all
deny from (an individual IP address or range)
</Limit>
ErrorDocument 403 http://www.google.com
you have to use IptoLocation Script(http://www.ip2location.com/) and then you can check the city or region after that Blocking a City or Region from website.
There are several ways to do this. First you can use any geoIP API to query the location of the visitor. Just google geoIP API to see what's available. There are online solutions and downloadable databases as well.
You can also "wire in" the banned IP address blocks into your webapp. You can query block information at http://location2ipaddress.com/ for example. If you want to use the .htaccess file to do this, then all you need is the ip range data - put this into the deny list and you're done.
Whatever you do, blocking is a delicate topic.
It is easy to go around the block by using proxy servers.
Blocking whole ranges is risky, you are preventing innocent users from accessing your website. There is no harmless, 100% safe way to do this.

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Will Google be able to access my website after blocking all US IPs?

I'm going to block all US IPs using .htaccess this way :
<Limit GET HEAD POST>
order deny,allow
deny from 3.0.0.0/8
deny from 4.0.0.0/25
deny from 4.0.0.128/26
deny from 4.0.0.192/28
deny from 4.0.0.208/29
....
allow from all
</Limit>
Will Google be able to access and index my website after blocking all US IPs?
EDIT : Sorry for the ambiguity, but I DO want Google to index my website.
Although Google has its servers spread across the whole world, it would be quite hard to say where the search engine's bots mostly originate from. What I suggest would be to block the IP ranges but add an exclusion clause that matches against the User-Agent for search bots like:
SetEnvIfNoCase User-Agent (googlebot|bingbot|yahoo!\sslurp) is_search_bot
<Directory /docroot>
Order Deny,Allow
Deny from 3.0.0.0/8
Deny from 4.0.0.0/25
Deny from 4.0.0.128/26
Deny from 4.0.0.192/28
Deny from 4.0.0.208/29
Allow from env=is_search_bot
</Directory>
I don't think so, but if you really don't what google to index it then use a robot.txt file so it doesn't index it. The robot.txt would be
User-agent: googlebot Disallow: /directory/
If it's just a matter of blocking US ip and that's it then you're probably good, as google has data centers in many different locations, not just the United States. This means that google will still probably index it.
Although google has many data centers , but all their bots are in US so no google will not be able to scan your website if you block us ips
If you can't access your domain root directory, just use this meta tag to block google bot index specific page(s):
<meta name="googlebot" content="noindex">
If your site was indexed already by google crawler, following the guide Remove your own content from Google search results
Access: https://www.google.com/webmasters/
There all information that you need.
Here, the Google teach how you can block the Googlebot index your site:
https://support.google.com/webmasters/answer/93708
About your question, I think that if you block all US IP Address, the "Google other country" must access and index your site, then he must sync with Google US.

htaccess deny all by ip address except those in united states?

I have a local Web site that I would like to tighten access to only those within the United States; or perhaps only within Florida. It's a Word Press site that has gotten hacked due to some weak code. I've seen two sources of IP address lists for .htaccess "allow deny" control by IP Address.
IP by Country/Continents:
http://www.countryipblocks.net/continents/
Wizcrafts List:
http://www.wizcrafts.net/htaccess-blocklists.html
What is the best approach for blocking everything except United States traffic? How would you approach the deny/allow? Would you deny other Countries or try to allow only the U.S.?
Thanks for any comments, Jeff
Add this list to the .htaccess located on the root folder of your server.
It will only allow connections from the US.
ex .htaccess file:
allow from IP
203.31.234.0/24
129.230.176.0/20
etc...
you can use deny from All in order to forbid access to your site!
In countryipblocks you can download all IPs from the area you want and add allow from IP to your .htaccess file! so only those IPs can access to your site!
Edit: Remember you can add IP range instead of one IP!
I downloaded .htacees from that site, and that was ok!

Resources