404 handler and dynamic pages that really don't exists... bad for SEO?

404 handler and dynamic pages that really don't exists... bad for SEO? - iis

We have an IIS 404 asp.net handler that renders pages when an html page is not found. It uses the page's URL to query our Databases and builds rich relevant content on the fly. From what I can tell in the IIS logs and anaylyzing the pages from web browser tools there is NO indication the page does not actually exist and was dynamically generated.
In these cases is IIS actually sending a 404 to the client? Is there a redirect of any kind actually happening? Will Search engines punish me for this?
It's been 2 months and Google has indexed everything, but Bing and Yahoo have not indexed anything dynamic dispite my submitting various Directory pages, Sitemaps and Feeds with all my links. My home page is indexed on all search engines and has all my links. When I search very unique keywords in those links, I can see that bing and yahoo do see them on my Home Page links - but only there.
Is there anything I can run or check to make sure my dynamic pages are not somehow viewed as bad by Search engines? Any way to check if a 404 (whatever a 404 actually is to a client besides just another page) is returned to crawlers?
Many Thanks.

Is there anything I can run or check to make sure my dynamic pages are
not somehow viewed as bad by Search engines?
Dynamic pages are just fine. Most of the content on the Internet is dynamically produced. The search engines don't care if content is dynamic and, in fact, they usually do not know content is dynamic as all they see if the URL and the HTML that is produced by that URL.
Any way to check if a 404 (whatever a 404 actually is to a client
besides just another page) is returned to crawlers?
Use a tool like Firebug or the built in developer tools in Chrome to view your HTTP headers. Crawlers see the same headers a browser would see so that is an easy way to tell what headers your pages are sending out.

Related

URL Rewrite IIS and search engine

I've configured my IIS (asp.net site) to use URL Rewrite.
In particular this is my rule (dynamic one): whatever url in format number/string will be redirected to a special aspx page.
SSo whatever url starts with mysite/id/Name is redirected to showprof.aspx?id=id&title=Name. This works perfectly.
My question is about search engines. I don't have any "fixed" page that contains links like mysite/id/Name that the spider can scan, so I'm trying to figure it out how search engines could index my dynamic pages. Should I create a sitemap.xml? if yes in wich way? or should I create a "hidden" page that contains every link to all my dynamic contents like mysite/id1/Name1 mysite/id2/Name2 and so on?
thank you

A starting point is definitely a Sitemap.xml, You could try for example the IIS SEO Toolkit and see if it is able to index any of your pages: http://www.iis.net/downloads/microsoft/search-engine-optimization-toolkit
It also has functionality to generate a sitemap.xml, although I'm guessing in your case you probably have some dynamic content, so a better approach would be to have a "handler" that generates it dynamically on demand (maybe cache it for performance reasons).
I would also recommend to have some pages that actually are accessible through normal links, for example maybe have in your home page of the site a link to a "site map" page (not sitemap.xml), where there you render a set of links that you want to index (at least the ones that are most important to you), and that will make them easy to discover.

site google tag does not show all results

If I go to this url
http://sppp.rajasthan.gov.in/robots.txt
I get
User-Agent: *
Disallow:
Allow: /
That means that crawlers are allowed to fully access the website and index everything, then why site:sppp.rajasthan.gov.in on google search shows me only a few pages, where it contains lots of documents including pdf files.

There could be a lot of reasons for that.
You don't need a robots.txt for blanket allowing crawling. Everything is allowed by default.
http://www.robotstxt.org/robotstxt.html doesn't allow blank Disallow lines:
Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Check google webmasters tools to see if some pages have been dissallowed for crawling.
Submit a sitemap to google.
Use "Fetch as google" to see if google can even see the site properly.
Try manually submitting a link through the fetch as google interface.
Looking closer at it.
Google doesn't know how to navigate some of the links on the site. Specifically http://sppp.rajasthan.gov.in/bidlist.php the bottom navigation uses onclick javascript that gets dynamically loaded and it doesn't change the URL so google couldn't link to page 2 it even if it wanted to.
On the bidlist you can click into a bid list detailing the tender. These don't have public URLs. Google has no way of linking into them.
The PDFs I looked at were image scans in sanskrit put into PDF documents. While Google does OCR PDF documents (http://googlewebmastercentral.blogspot.sg/2011/09/pdfs-in-google-search-results.html) it's possibly they can't do it with sanskrit. You'd be more likely to fidn them if they contained proper text as opposed to images.
My original points remain though. Google should be able to find http://sppp.rajasthan.gov.in/sppp/upload/documents/5_GFAR.pdf which is on the http://sppp.rajasthan.gov.in/actrulesprocedures.php page. If you have a question about why a specific page might be missing, I'll try to answer it.
But basically the website does some bizarre non-standard things, this is exactly what you need a sitemap for. Contrary to popular belief sitemaps are not for SEO, it's for when google can't locate your pages.

How to check a site if it is having a custom 404 page or a default one?

I am creating a SEO Audit tool using NodeJS. I want to check if a URL has setup a custom 404 page or not. How can I check ?
I have analysed the response for both custom 404 page and default one both return same content-type and response headers. Both return HTML content only so how can I decide if it is a custom 404 page or not.

If this is very important for you to know (maybe you are selling custom 404 pages), you'll need to examine the HTML returned by the request.
Many popular servers, such as tomcat, iis, and apache return a standard 404 page that you should be able to recognize. Same thing with frameworks such as django or rails. You could build some logic that compares 404 results with the "fingerprints" of a known population of default 404 pages.
For example certain versions of tomcat have a title on their error pages that looks like this:
<title>Apache Tomcat/7.0.50 - Error report</title>
If you see something that looks like that you can be pretty sure that you are dealing with the default tomcat error page.
There are machine learning techniques that can probably do this for you without needing to compile a library of 404 page fingerprints (similar to filters that distinguish spam messages from legit ones).

If a page is not linked to the main website, can search engines find it?

I want to put a secret page in my website (www.mywebsite.com). The page URL is "www.mywebsite.com/mysecretpage".
If there is no clickable link to this secret page in the home page (www.mywebsite.com), can search engines still find it?

If you want to hide from a web crawler: http://www.robotstxt.org/robotstxt.html
A web crawler collects links, and looks them up. So if your not linking to the site, and no one else is, the site won't be found on any search engine.
But you can't be sure, that someone looking for your page won't find it. If you want secret data, you should use a script of some kind, to grant access to those, who shall get access.
Here is a more useful link : http://www.seomoz.org/blog/12-ways-to-keep-your-content-hidden-from-the-search-engines

No. A web spider crawls based on links from previous pages. If no page is linking it, search engine wouldn't be able to find it.

Webmaster Tools Crawler 403 errors

Google Webmaster Tools is reporting 403 errors for some folders on the websites server for example:
http://www.philaletheians.co.uk/Study%20notes/
The folder isnt forbidden so dont understand why it would be 403 errors for Googles Crawler?
How come the Google Crawler is trying to browser the actual folders and not just going straight to the files in that folder? Is this somthing to do with robots.txt ?

Make sure is there any actual place or document to be present if some one request that url. I've browsed through your site and could not found a link that directs to http://www.philaletheians.co.uk/Study%20notes/
Also it seems, all the study notes are inside this "Study%20notes" directory.So actual this link will not work anyway. So check the google web master tools's link from to find where this broken link situate and cure it.

Have you set default document correctly in your web server? In apache, this comes in the DirectoryIndex setting (and defaults to index.html). Also, in general it might be better to strip off spaces etc.. from your traversable directory names (the %20 you are seeing between Study and notes is a url-encoded space character), so as to keep your URLs clean to your visitors and search engine bots.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string