Background
I'm currently developing a website - simplycaravan.com - and am struggling to get Google to properly index the site. We have many (potentially hundreds) of similar but unique pages that display information about different holiday homes. The URLs for these pages all begin "http://simplycaravan.com/caravan-details/?id= " and then each page has a unique id= number to distinguish it e.g. "http://simplycaravan.com/caravan-details/?id=55". The problem is that Google is not seeing each page of these pages as unique and therefore not indexing them.
Question
Each of these 'id=' pages has the below tag in the header
link rel="canonical" href="http://simplycaravan.com/caravan-details/?id=55" />
Can anyone advise if this tag should be used or is causing a conflict which leads to the problem? My understanding is that canonical tag should be used if you have multiple similar pages and you want to redirect to one page, this is not the case for our website as we do not want to redirect all these apparently similar pages to one page, we want them to be standalone, unique webpages in their own right.
One thing to note is we have changed the id parameter in Google search console from 'let google decide' to 'every url'.
Thanks in advance for any help offered.
Related
This question was migrated from Stack Overflow because it can be answered on Webmasters Stack Exchange.
Migrated 20 days ago.
How do I tell search engine crawlers not to check a checkbox when indexing my site? I want to do something like this:
<input type="checkbox" rel="nofollow" />
, but the rel attribute is not listed in the list of attributes here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input, which makes sense, because this isn't a link. But I am not sure how to tell search engines that they shouldn't check this when filling out the form in that case. If I include this rel="nofollow" attribute here, will search engines comply anyway, even though it is not valid?
You can block Googlebot from crawling your forms by excluding them in your robots.txt file.
It seems that Google only follows GET forms. So make it a POST.
That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site.
From developers.google
For indexing different search results, I suggest the following approaches
Use different params or urls for different search results and put them in the sitemap and refer to them on a sacrificial or site-map page.
Detect that a robot is crawling and make some inputs read-only and autofill others to achieve different search results.
Currently Google displays elements in the result excerpts that belongs to the functional part of the site. Is there a way to exclude these elements to get crawled/displayed in google?
Like eEdit, eDelete, etc in the example above.
To exclude the pages from Google's index, block them using the Robots.txt file or if it is just the content then use the "rel="nofollow" tag.
Hope this helps.
Update on my particular situation here: I just found out that the frontend code has been generated in a way where the title and the description meta was identical.
Google is smart enough to expect that if a copy is already displayed in the title of the search result there's no reason to add in to the excerpt as well, instead looks for content - believed to be valuable - from the actual page.
Lessons learned:
there's no way to hide elements from google but keep it visible for your users
if you'd like to have control over the content displayed in google searches, avoid using the same copy in your title and description
We are using Liferay as a classic CMS meaning that we compose pages using web content articles. There is an issue with Liferay's internal search I could not yet find a proper answer for:
Because web content articles are pretty much only building blocks for pages we don't want the search to show them as distinct items. The user should only get a list of pages that contain their search keywords, including all the articles put onto this page.
At the moment we can see two different approaches and both come with certain problems we could not solve yet:
Idea 1
We modify the journal indexer and try to obtain all URLs of the pages (how?) where the article has been placed on. Then we add them to the document to be indexed. In the search result we then can access the URLs and collect them. In the end we make sure every URL is only shown once.
Idea 2
At some point Liferay renders the entire page before sending it to the browser. If we somehow could put an indexer there, we could index the entire page. We then could limit the search to the special "page documents". Getting the fully rendered page would be the main issue here, because either we would have to run a crawler to frequently trigger this indexing or we would need to find a way to trigger page rendering from within an indexer or something like that.
I have been carrying this problem around for quite a while now and still could not find an idea good enough to spend time trying it out. If anyone of you has some input on those two ideas or maybe an entirely different approach, I would be extremely grateful.
I'll just answer myself, because by now we found a suitable solution to solve our problem:
In addition to the default search portlet there is also a "Web Content Search Portlet" shipped with Liferay. It seems to have been part of Liferay for quite a while now, but it's somewhat hard to find, because there is hardly any documentation for it (I only found the Liferay wiki page, which isn't really anything at all). It searches only within web content articles and shows links to the pages rather than just a link an isolated view of the article. It has much less configuration options than the default search portlet, however. Pretty much all it allows to change is whether articles actually have to be placed on at least one page to show up in the results.
So there is no need for any kind of custom indexer or any other "hack"...all we need to do is use the correct portlet. We will only need to write a hook that changes the appearance of the result page.
What you ask is interesting but your ideas are on the wrong direction.
Specially idea 2 it's particulary wrong because you cannot do indexing work meanwhile a page is rendered. Think about performace only.
In Liferay pages and assets are not directly linked: pages have portlets and portlets display assets (web content and more).
Liferay indexing refers and scans assets content, not refers the display result of the assets. Think about permission: the same page can display different contents depends on the user who looks.
bye
I've read a bit on the matter of friendly urls and I'm a little unsure as to what is better.
I currently have my website using a structure of http://www.domain.com/page.php?id=2
I am using the record id to determine the content of the page. My record id's are numeric and increment for new pages added. The content of existing pages can change completely over time. But, still use the same record id (this is a cms so the client may do this).
The way I understand it I have two options for friendly urls:
http://www.domain.com/page/2
http://www.domain.com/some-text-describing-the-page
Now because I identify the content by the record id, I would assume the first option would make more sense.
My client seems to want option two.
After some reading I found two conflicting points.
As per Tim Berners-Lee (the architect of the WWW) he states that you want a URI which will have the potential to remain the same 2 months, 2 years, 200 years from now. So you DO NOT want to use a page title or something similar for your pages. If you change your pages content you are either forced to change the content and leave the URI alone, or change the URI and are stuck with dangling links. You can read his article here (http://www.w3.org/Provider/Style/URI)
However, a number of other people on the internet (with no know authority to me) clearly state that you need to have a descriptive yet short URI for the best SEO value. From what I read, mostly for the purpose of backlinks and having keywords in the anchor text since people just use the link itself for the anchor text. So having keywords in the link itself helps search engines know what the link is about without a custom title.
It seems to me the difference has to do with long term VS short term.
Am I grasping this correctly?
If I am to use a slug style URI as defined by the user, do I have to just allow my user to type in whatever they want to a field and check against the current database to see if it exist? If so, am I supposed to anticipate static links by running a query for the know record id and then use the result to generate the url which would just be rewritten back to the format: http://www.domain.com/page.php?id=2?
It seems to me that would be a lot of extra overhead.
I would suggest something in the middle of those two:
http://www.domain.com/page/2/some-text-describing-the-page
or without page:
http://www.domain.com/2/some-text-describing-the-page
You can still get page Id from the Url, and there is a title as well! And what even more important, you're still able to get correct content, even when page title change later.
So think about situation like that: User creates a page, it receives Id=4 and it's title is My great title. From that information Url is generated, and is e.g. http://www.domain.com/page/3/my-great-title. After 2 months user changes the title to This title is better then the last one!. Url changes as well to http://www.domain.com/page/3/this-title-is-better-then-the-last-one. However, there is still 3 within the Url, so you're able to show right content! You can also check, if the rest of Url is actual, and redirect (301 would be the best one) to new one to let search engines know, that Url changed.
Looking through my search logs from time to time, I notice that by far the biggest user of my search engine is the google-bot. What gives? Is it looking for content that might not be directly accessible through navigation? If so, how does it know which words and phrases to look for (they're surprisingly relevant). Does it check the most popular keywords on the site? I know I seem to be answering my own question here, but this is really only working it out from first principles. I'd like to hear from someone who knows what they're talking about (i.e. not me).
If your search form's method is get instead of post, each search has its own url, and people might be posting those urls elsewhere. Or if you have a (possibly inadvertently) publicly accessible webstats page that listed those urls, that's another common way for search engines to stumble upon your internal search urls. A third way I've seen is sites that list recent searches on their pages, but this is more intentional. "MySQL Performance Blog" does this to an annoying extent, so any search of their site from google yields hundreds of pages of similar searches, even if none of them found what they were looking for.
Edit: Looks like it does on occasion, but only GET forms:
http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html
Google will use words that occur on your site in search boxes to try to find pages that it can't otherwise.
Google says that for the past few months, it has been filling in forms
on a "small number" of "high-quality" web sites to get back
information. What words has it been entering into those forms? Words
automatically selected that occur on the site, with check boxes and
drop-down menus also being selected.
http://searchengineland.com/google-now-fills-out-forms-crawls-results-13760