Ban robots from website [closed]

Ban robots from website [closed] - bots

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address:
46.229.164.98
46.229.164.100
46.229.164.101
But I've no idea about how to do this.
I've googled a bit and I've now added these lines to .htaccess in the root:
# allow all except those indicated here
<Files *>
order allow,deny
allow from all
deny from 46.229.164.98
deny from 46.229.164.100
deny from 46.229.164.101
</Files>
Is this 100% correct? What could I do?
Please help me. Really I don't have any idea about what I should do.

based on these
https://www.projecthoneypot.org/ip_46.229.164.98
https://www.projecthoneypot.org/ip_46.229.164.100
https://www.projecthoneypot.org/ip_46.229.164.101
it looks like the bot is http://www.semrush.com/bot.html
if thats actually the robot, in their page they say
To remove our bot from crawling your site simply insert the following lines to your
"robots.txt" file:
User-agent: SemrushBot
Disallow: /
Of course that does not guarantee that the bot will obey the rules. You can block him in several ways. .htaccess is one. Just like you did it.
Also you can do this little trick, deny ANY ip address that has "SemrushBot" in user agent string
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
SetEnvIfNoCase User-Agent "^SemrushBot" bad_user
SetEnvIfNoCase User-Agent "^WhateverElseBadUserAgentHere" bad_user
Deny from env=bad_user
This way will block other IP's that the bot may use.
see more on blocking by user agent string : https://stackoverflow.com/a/7372572/953684
Should i add, that if your site is down by a spider, usually it means you have a bad-written script or a very weak server.
edit:
this line
SetEnvIfNoCase User-Agent "^SemrushBot" bad_user
tries to match if User-Agent begins with the string SemrushBot (the caret ^ means "beginning with"). if you want to search for let's say SemrushBot ANYWHERE in the User-Agent string, simply remove the caret so it becomes:
SetEnvIfNoCase User-Agent "SemrushBot" bad_user
the above means if User-Agent contains the string SemrushBot anywhere (yes, no need for .*).

You are doing the right thing BUT
You have to write that code in .htaccess file , not in Robots.txt File.
For denying any Search Engine from crawling your site, the code should like this
User-Agent:Google
Disallow:/
It will disallow Google from crawling your Site.
I would prefer .htaccess method by the way.

Related

What is this file in .htaccess?

I am realy wonder why in .htaccess has those code bellow, can tell me what is this code?
<Files 403.shtml>
order allow, deny
allow from all
</Files>
deny from 212.92.53.18

It is not definitely malware.
At least, not in the sense it's intended for malicious reasons...
In the case you are using cpanel and you have used its IP Deny Manager to block access to 212.92.53.18 then this will automatically be written to your .htaccess file with the intended purpose of blocking that IP (and any others you may wish to enter):
<Files 403.shtml>
order allow, deny
allow from all
</Files>
deny from 212.92.53.18
Do you use cpanel and if so, do you remember doing that?

Allowing the 403 to All simply prevents a loop. If you block an IP using the 'deny from' method, then serving of the 403 to that IP would also get blocked, creating a loop. Allowing the specific 403 file to ALL, will override the block -- of serving the 403 to that specific IP -- that otherwise would have occurred. That prevents a loop.

<Files 403.shtml>
order allow, deny
allow from all
</Files>
I used it myself on an old domain. It simply says "allow anyone to access the file named 403.shtml"; which is the forbidden access error. Of course, you would use this usually if you created a custom 403.shtml page.
The denied IP in this case would not see the custom 403.shtml and instead would get a White-screen-of-death.
So this is not, in any way shape or form, malware related.

UPDATE: This answer was based on speculation using the facts provided when it was originally posted. The overall consensus seems to be this modification of the .htaccess file is most likely the result of using server management software such as CPanel so it’s not—on its own—an indication of malware infection.
The contents of that .htaccess are a bit odd.
<Files 403.shtml>
order allow, deny
allow from all
</Files>
deny from 212.92.53.18
The <Files 403.shtml> part refers to the 403.shtml file and it seems to be allowing a custom 403: Forbidden response (assumption based on file naming) .shtml file to be sent. The order allow, deny and related allow from all explain it to me. It seems like the site is blocking all traffic in some way but wants that 403.shtml to come through?
But the deny from 212.92.53.18 is quite specific & odd as a result. That is basically blocking any/all access from 212.92.53.18.
Now typing that out it seems like the .htaccess is set to explicitly deny access from address 212.92.53.18 which would send a 403 response code, and the <Files 403.shtml> allows the actual 403: Forbidden htaccess page to be sent?
But still, it seems odd for a directive to block traffic from one single IP address would be in an .htaccess file like that.
EDIT: Did a Google search for <Files 403.shtml>—because if you know Apache configs, that is a highly odd directive—and it seems like this might be part of some malware? Look at this page as well as this page and this other page.
Seems like this is part of a definite XSS backdoor? Perhaps the .htaccess is in a malware directory, and the deny from 212.92.53.18 is denying the infected server from accessing itself?
ANOTHER EDIT: Okay, putting on my thinking cap—as well as personal experience with web malware—and looking at the specificity of the deny from 212.92.53.18 I think I know what the deal is. This is part of a malware infection. But I bet that 212.92.53.18 is a node on a bonnet because you can curl -I it & visit it in a browser & it seems to be an active server. Most client IP addresses just won’t do that; who has a web server exposed on a basic ISP connection, right? Unless the machine is infected. So the 403.shtml is not actually a real 403: Forbidden page but actually part of the malware. Meaning, a connection being made FROM 212.92.53.18 would trigger 403.shtml—which is a server side include HTML file—that could be used for unauthorized access. I mean, when has anyone in 2014 last seen active .shtml files on legit servers, right? It’s all PHP, Python, Java or Ruby nowadays.

This?
<Files 403.shtml>
order allow,deny
allow from all
</Files>
deny from xx.xx.xx.xx
Hacker? Backdoor? Malware? Ukraninian DOS attack?
Of course it IS NOT. It's nothing of the sort.
It is automatically generated by cPanel, when the "IP Blocker" is used.
cPanel writes it to your .htaccess file
The 'deny from' is simply the IP specified when using the cPanel IP Blocker tool. cPanel is clever enough to know a little more is needed than just a simple 'deny' IP4 entry.

Probably it's terrorific hack and malware. Ukraine/Russian/Indonesian hackers. On july 2016 they have attacked a lot of sites with Prestashop with a vulnerability on image file uploads. They upload that 403.shtml to the root and then they destroy the server and files. I have checked that my web is on their web page that inform hacked websites. They block some nights your access to the web with a DDOS attack to get the pass of mysql and ftp. In prestashop you have to upload urgent to 1.6.1.16 or upload some protection files. Unfortunately, I have do that, but they don't stop and try again blocking my webshop.
The only another option is that you put block ip on cpanel, but the trick is what Giacomo1968 says in their answer. Congratulations.

Is BLEXBot crawler used by Google? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I have setup my htaccess this way
SetEnvIfNoCase User-Agent .*google.* search_robot
SetEnvIfNoCase User-Agent .*yahoo.* search_robot
SetEnvIfNoCase User-Agent .*bot.* search_robot
SetEnvIfNoCase User-Agent .*ask.* search_robot
Order Deny,Allow
Deny from All
Allow from env=search_robot
I have this bot showing up:
IPv4 address:198.143.187.122
Reverse DNS:blexn3.webmeup.com
RIR:ARIN
Country:United States
RBL Status:Clear
Thread:No threats detected
Is this bot used by Google or I am missing something?

No BLEXBot is not google. It belongs to a company called WebMeUp. You can find information about them here.
If you lookup the IP in the log you will see it's not Google.
IP Address 198.143.187.122
Host blexn3.webmeup.com
Location US US, United States
City Chicago, IL 60661
Organization SingleHop
ISP SingleHop
Google IPs will list Google as the organisation.
Google use their own Bots, they are custom built. You can read up about them here, including a definitive list of their user-agent strings which may be useful to you.
To block follow the instructions here.

Will Google be able to access my website after blocking all US IPs?

I'm going to block all US IPs using .htaccess this way :
<Limit GET HEAD POST>
order deny,allow
deny from 3.0.0.0/8
deny from 4.0.0.0/25
deny from 4.0.0.128/26
deny from 4.0.0.192/28
deny from 4.0.0.208/29
....
allow from all
</Limit>
Will Google be able to access and index my website after blocking all US IPs?
EDIT : Sorry for the ambiguity, but I DO want Google to index my website.

Although Google has its servers spread across the whole world, it would be quite hard to say where the search engine's bots mostly originate from. What I suggest would be to block the IP ranges but add an exclusion clause that matches against the User-Agent for search bots like:
SetEnvIfNoCase User-Agent (googlebot|bingbot|yahoo!\sslurp) is_search_bot
<Directory /docroot>
Order Deny,Allow
Deny from 3.0.0.0/8
Deny from 4.0.0.0/25
Deny from 4.0.0.128/26
Deny from 4.0.0.192/28
Deny from 4.0.0.208/29
Allow from env=is_search_bot
</Directory>

I don't think so, but if you really don't what google to index it then use a robot.txt file so it doesn't index it. The robot.txt would be
User-agent: googlebot Disallow: /directory/
If it's just a matter of blocking US ip and that's it then you're probably good, as google has data centers in many different locations, not just the United States. This means that google will still probably index it.

Although google has many data centers , but all their bots are in US so no google will not be able to scan your website if you block us ips

If you can't access your domain root directory, just use this meta tag to block google bot index specific page(s):
<meta name="googlebot" content="noindex">
If your site was indexed already by google crawler, following the guide Remove your own content from Google search results

Access: https://www.google.com/webmasters/
There all information that you need.
Here, the Google teach how you can block the Googlebot index your site:
https://support.google.com/webmasters/answer/93708
About your question, I think that if you block all US IP Address, the "Google other country" must access and index your site, then he must sync with Google US.

Preventing access to file in .htaccess results in error on request of any page from server? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
i have a logging folder, where i log events, i want to include it in the admin panel or be able to search in it, but i dont want it to be accessible outside of localhost.
So i tried
<Files /var/chroot/home/content/59/10667659/html/mylogs/log.html>
Order Deny,Allow
Deny from all
Allow from 127.0.0.1
</Files>
But this results in a 500 something error on all of my pages?
/var/chroot/home/content/59/10667659/html
is
$_SERVER['DOCUMENT_ROOT'];

The <Files> container does not work that way. From the Apache core doc:
The directive limits the scope of the enclosed directives by filename. It is comparable to the and directives. It should be matched with a directive. The directives given within this section will be applied to any object with a basename (last component of filename) matching the specified filename.
So you can say:
<Files "log.html">
Order Deny,Allow
Deny from all
Allow from 127.0.0.1
</Files>
Or if you want it to apply only in a specific directory, like in /mylogs/, then create the htaccess file in there and add the above <Files> container.

How could I redirect or deny users from a particular country with my htaccess file?

I looked at countryipblocks.net, and need to clarify...
If I want to block users from, say, Andorra from visiting my site, what exactly needs to be added to my (already existing) .htaccess file?
Do I need to simply add this block of text to my .htaccess?
<Limit GET HEAD POST>
order allow,deny
deny from 85.94.160.0/19
deny from 91.187.64.0/19
deny from 194.117.123.178/32
deny from 194.158.64.0/19
deny from 195.112.181.196/32
deny from 195.112.181.247/32
allow from all
</LIMIT>
On the other hand, if I want to redirect users from, say, Croatia, from http://mywebsite.com to http://google.com or a landing page, what exactly needs to be added to my .htaccess file?
Finally - how would "deny" appear to the user being denied access?
Thanks.

Visitors who are within a IP range that is banned by deny will be served with a 403 error. If you want to them to see a nice page, instead of the standard Apache error, then you will need something like
ErrorDocument 403 /errors/403.html
in your .htaccess file. It is fairly easy to check rules based on IP addresses are working in your .htaccess by setting the blocked IP to be 127.0.0.1 (i.e. localhost); when you then look at the page in question on localhost, you should see the result of the page being blocked.
In answer to your question about redirecting users, blocking all users from any 1 country seems a little bit overkill; however, try reading up on the RewriteCond directive.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string