robots.txt - disallow page without querystring [closed] - linux

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a page that serves up dynamic content
/for-sale
the page should always have at least one parameter
/for-sale?id=1
I'd like to disallow
/for-sale
but allow
/for-sale?id=*
without affecting the bot's ability to crawl the site or the possibility of affecting negatively on SERP's.
Is this possible?

What you want does not work using robots.txt:
There is no such thing as Allow: in the robot exclusion standard, although the RFC written by M. Koster suggests so (and some crawlers seem to support it).
No such thing as query strings or wildcards is supported, so disallowing the "naked" version will disallow everything. Surely not what you want.
Anything in robots.txt is an entirely optional, and merely a hint. No robot is required to request that file at all or respect anything you say.
You will almost certainly find one or several web crawlers for which any or all of the above is wrong, and you have no way of knowing.
To address the actual problem, you could put a rewrite rule into your Apache configuration file. There is readily available code available for turning an URL with query string into a normal URL (example from a quick web search).
(Alternatively, you could just leave the id query string in place. The One Search Engine that makes up 85% of your traffic eats them just fine, and the other two that make up 90% of what is not Google do as well.
So your fear is really only about search engines that nobody uses, and about spam harvesters.)

I think this should work
Disallow: /for-sale
Allow: /for-sale?id=*&*
Allow: /for-sale?id=*

Related

SEO, content duplication and pagination [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a layout similar to this, only it's not debates, so I'll use this as an example for the question.
As you can see, they have 3 different tabs, for rounds, comments and votes but these are all in one page, whereas in my case, I have different pages for comments and votes like this
example.com/post/1 <- main post's url
example.com/post/1/comments
example.com/post/1/votes
and both comments and votes are paginated, so there can be urls like this:
example.com/post/1/comments/page/3
So I wonder how I should manage this kind of situation from the SEO perspective, won't the fixed part of the debate above the tabs considered a duplication? And what happens if I add a canonical link to let's say, comments page, leading to the main post's url, will the comments be indexed or only the main post's page will?
won't the fixed part of the debate above the tabs considered a duplication?
No, if it is repeated on every page, it will be considered as boiler plate content and be ignored for ranking, because it is not specific to the page itself.
And what happens if I add a canonical link to let's say, comments page, leading to the main post's url, will the comments be indexed or only the main post's page will?
If Google trusts and agrees with your canonical link, then only the main post will be used for indexing.

Search Filters and SEO -- nofollow, canonical, or nothing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have an eCommerce site that I am working on. The primary way customers find products is by filtering down results with a search menu (options include department, brand, color, size, etc.).
The problem is that the menu creates a lot of duplicate content, which I am afraid will cause problems with search engines like Google and Bing. A product can be found of multiple pages, depending on what combination of filters are used.
My question is, what is the best way to handle the duplicate content?
As far as I can tell, I have a few options: (1) Do nothing and let search engines cache everything; (2) use a canonical link tag in the header so search engines only cache departments; (3) put rel="nofollow" in the filter links-- though, to be honest I'm not sure how that works internally; (4) put noindex in the header of filtered pages.
Any light that can be shed on this would be great.
This is exactly what canonical URLs are for. Choose a primary URL for those pages and make that the canonical URL. This is usually one that isn't found using filters. This way the search engines know to display that URL in the search results. And if they find the filtered pages from incoming links they give credit to the canonical URL which helps its rankings.

SEO using .htaccess? And, how to redirect a fake subdirectory to an individual page? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Since a recent redesign of our website, we've noticed that the search rankings for certain pages has plummeted as individual publications are no longer on their own page, but rather on publications.php?magazine=xx, where xx is a unique ID number for the publication.
Is there any way to use a .htaccess file to redirect fake subdirectories to the pages, i.e. visiting /publications/magazine-name takes you to publications.php?magazine=xx, and if so: would this even have an effect on their SEO?
If not, is there any other way you can make these URL query strings more search engine-friendly?
I'm only halfway there, but using the mod_rewrite tool with something like:
RewriteRule ^advanced-lift-truck?$ pub-automotive.php?mag=1 [NC,L]
can get me a URL that Google will understand and trawl.
Now, it's just a case of figuring out what I can do about each page effectively having the same "content", just with different CSS showing/hiding parts.
I'm investigating the following:
http://www.webdesignerdepot.com/2013/10/how-to-optimize-single-page-sites-for-search-engines/

Domain aliasing and SEO [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
Improve this question
I've had my photography site at photography.brianbattenfeld.com, but now it's becoming my primary income and I'm doing it pretty much full time so my primary domain should be my photography portfolio.
I'm thinking about having brianbattenfeldphotography.com and/or brianbattenfeld.com be my new domain for photography.
So my questions are:
If I make brianbattenfeldphotography.com just an alias of photography.brianbattenfeld.com are there significant SEO or analytics issues I should be worried about?
Will one perform better than the other, or rank higher?
Does it make a difference which one people visit?
Do search engines generally acknowledge the alias as 'secondary' somehow, because it's not where the files are actually stored?
A lot of questions I know, but I'm just trying to figure out what impact this may have.
In general, when moving a site or just changing the domain (because that is what you're doing, changing from a subdomain to the primary one), do NOT create duplicate content.
Essentially, if you go to subdomain.domain.com and get the same site as www.domain.com without the URL changing, you have duplicate content.
What I would suggest, is that you create a forward (301) from subdomain.domain.com to domain.com. That way, Google will transfer all your rank from the old URL to the new URL. It can take some time to happen, but it will happen.
So to answer your questions:
Do not make an Alias (that would make duplicate content)
They will perform differently, based on number of inbound links. They could also perform poorly, both of them, if Google sees it as duplicate content.
No difference to the visitors
It's not "secondary", it is a separate page. On this however, I feel I need to mention Canonical URLs. They should only be used when you have two different sites where some pages contain the same body as another, either on the same or different domain. Using canonical URLs for each page is A) overkill and B) not a great idea. You might as well have a 301 re-direct. You can read more about Canonical URLs here: http://googlewebmastercentral.blogspot.se/2009/02/specify-your-canonical.html
Hopefully that answers your question.

Just curious (so I know how it works): how do search engines find web-sites (if no one knows it) and folders in it? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
The answers for the first question would be link to the web-site from crawling page(from the page search engine knows already). But, if you type very_long_name_without_any_sense_123kni.com, I guess it will find it anyway.
The second question is about folders.... If you have robots.txt in your root directory, then it's a bit clear. But, if you have no robots.txt on your web-site, how will search engine find all the folders that are allowed to be accessed?
If a search engine knows your web-site but your web-site has no robots.txt, how long will it take to appear at most popular search engine? In 10 minutes? 1 hour? 1 day? 1 week? never? How dangerous is it to leave pages (that should be protected) unprotected even for 1 minute, if your web-site is not crawled yed (because it's protected)?
P.S. These questions are not about steps how to make your web-site popular and to appear on the first pages among others... I'm just curious about principles how it works...
They can't, and don't.
that said, they can make some guesses based on knowing domain names (That information is accessible) and typical default website locations at those domain names.

Resources