How to check if site is accessible to search bot - search

Are there any tests that I can run to see if search bots can access my site?

spec: http://www.robotstxt.org/
Syntax check: http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php
accessibility check: http://phpweby.com/services/robots
All of which are friends of google, so it seemed

Related

Google Not Indexing Site - Says 'Blocked by Robots.txt' - However Robots.txt allows all crawlers -- Same problem with two different hosting services

I have built and published quite a few websites and never had the following issue:
Google is not indexing my website. Whenever I submit the page (in Google Search Console) it says "blocked by robots.txt" although the robots.txt allows every crawler (User-agent: * and Allow: /). The robots.txt is accessible via mydomain.com/robots.txt and the site's sitemap is accessible via mydomain.com/sitemap.
I have tried it with two different hosting providers: Dreamhost.com and Fastcomet.com. The issue persists however, and I cannot see why. The domains are registered with Namecheap.com which I have been using for many other sites since forever.
I use Grav CMS -- a terrific flat-file CMS -- which usually works flawlessly and I don't think that the CMS causes the problem.
Here below is a screenshot of Google's error message inside Google Search Console. Obviously, the robots.txt cannot be the culprit, since crawlers are allowed access.
Lastly, not even the domain is coming up in Google's search results. Usually, Google displays a domain without the accompanying description etc., if it is not allowed to crawl that domain.

How to setup a Share point Link for a google extension

So, I am trying to implement a SharePoint intranet site for an organization. However, there is one application in particular that they would like a link to on the homepage. Unfortunately this application can only be used via the IE tab google chrome extension (I know, dumb) but app devs have yet to add chromium compatibility.
Any way the link looks like this:
chrome-extension:
//hehijbfgiekmjfkfjpbkbammjbdenadd/nhc.htm#url=https://website.com/sub/sub.Hub.aspx
But share point requires a https:// on the beginning of a link.
If you throw that destination into chrome directly it navigates fine, but if you add say https://google.com/ on the front or https://*/ it doesn't work.
Is there a syntax that will allow me to put https:// on the front of this without getting a 404 error?
Never mind, I ended up re-directing this through IIS internally

Has anyone experienced Cloudflare 403 Errors with zombie.js web scraping?/

We're looking to do some scraping on a specific URL that uses cloudflare. Has anyone experienced issues using Zombie.js/user-agents while trying to crawl cloudflare hosted sites.
Would love some help!
I am trying to interface to an API on a client's site and I am getting a 403 error indeed. The request doesn't even reach my server.
Turning security to "essentially off" did not help. The final solution was to white-list the developer machine's IP.
The error is triggered on a single URL (json serving API) with a Java client with standards compliant libraries.
Solution:
1. try to set a rule to allow direct access for that URL
2. try setting security to weaker and weaker ("essentially off")
3. if both fails: try whitelisting
4. set up an alternate non-cloudflare url (direct.domain.com)
These will of course only work if you can negotiate with the site owners.
Backup solution: use an embedded browser that you can "frame" and "remote control" or a testing framework that does the same through a plugin, and extract the content from there (if you can)
Hope this helps.
You're probably triggering one of our security features by trying to scrape a site on us. The only option, really, would be to ask the site owner to whitelist your IP(s) to override the behavior.

Google is searching AWS Elasticbeans site( mysite.elasticbeanstalk.com) but not my site (mysite.com). What to do?

I was testing my site on AWS. And it is like mysitetest.elasticbeanstalk.com. But my original site is mysite.com. Now whenever i search for mysite google shows mysitetest.elasticbeanstalk.com links but not my original site. I have done all the verifications on webmaster tool for my site.
Is there any way to make elasticbeanstalk site completely private to me only and it is invisible to google? And if there are more suggestions please give me. All are welcome.
You should set up a robots.txt file for your test site to tell Google/crawlers to redirect the test site to the production site.
Example article: http://www.bruceclay.com/blog/how-to-properly-implement-a-301-redirect/
Your future test sites should have a robots.txt that tells google not to crawl it.

How to Allow Only Google, MSN/Yahoo bot access in .htaccess

i need help to only allow Google bot and Yahoo/MSN bot access to my site through .htaccess. Any help greatly appreciated.
For Google i got, not sure if that is right...
Allow from googlebot.com google.com google-analytics.com
Satisfy Any
I think your reasons for doing this are probably questionable, but the only way to really do this is by the reported User-agent (a HTTP request header), not by domain - and the reported user-agent can easily be spoofed by anyone. (This is also usually controlled through a robots.txt, but is typically for the opposite purpose - restricting crawlers, not normal users.) The servers that Google and others use to crawl sites won't have the same names or IPs as the names you listed.
For Google, some additional and official details of this are available at http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1061943 . Yahoo and MSN will have similar pages.

Resources