how stop indexing links with included subfolders - .htaccess

My real links on my website should index in google as (example):
www.mywebsite.com/title,id,sometext,sometext
unexpectedly google search indexing my website with subfolders whitch should not occur for example:
www.mywebsite.com/include/title,id,sometext,sometext
www.mywebsite.com/img/min/title,id,sometext,sometext
and so on
how can i stop these actions from indexing. What i have to change on htaccess or robots.txt? Help me, thanks

You need to update your robots.txt to prevent bots from browsing those pages and you should set a noindex on these pages to remove them from rankings. You may also want to explore canonical links if the same page can be served from multiple URLs.

Related

Effect of robots.txt

I understand that naming a file to disallow in robots.txt will stop well behaved crawlers from scanning that file's content, but does it (also) stop the file being listed as a search result?
No, both Google and Bing will not stop indexing the file just because it appears in robots.txt:
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/search/docs/advanced/robots/intro
It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed. To see how to prevent a page from being indexed see this topic.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec

redirect numerous dynamic urls to home page via .htaccess

I am trying to clean up a previously hacked WordPress site, and domain name reputation, the site has new hosting and is now on a different CMS system, but there are hundreds of spam links in Google I need to get rid of, they look like example.com/votes.php?10054nzwzm75042pw205039
Domain name, then votes.php?**** etc.. Numbers letters all sorts.
So how do I redirect ANYTHING that starts with the domain name then /votes.php?***
Any help greatly appreciated
Unless you have multiple domains, you don't need to explicitly check the domain name.
To send a "410 Gone" for anything that contains /votes.php in the URL-path (and any query string), you can do something like the following at the top of your root .htaccess file using mod_rewrite:
RewriteEngine On
# Serve a 410 Gone for any requests to "/votes.php"
RewriteRule ^votes\.php$ - [G]
A 410 is preferable to a "redirect" if you want to get these URLs removed from the search engines as quickly as possible.
To expedite the process of URL removal from Google then use Google's Removal Tool as well.
If you redirect these pages to the homepage then it will likely be seen as a soft-404 by Google and these URLs are likely to remain in the search results for a lot longer.

duplicate URLs in my page, best solution?

I have a website that write URLs like this:
mypage.com/post/3453/post-title-name-person
In fact, what is important is the post and ID part (3453). The title I just add for SEO.
I changed some title names recently, but people can still using the old URL to access, because I just get the ID to open the page, so:
mypage.com/post/3453/post-title-name-person
mypage.com/post/3453/name-person
...
Will open the same page.
Is it wrong? Google webmaster tools tells me that I have 8765 duplications pages. So, to try to solve this I am redirecting old title to post/id/current-title but it seems that Google doesn't understand this redirecting and still give me duplications.
Should i redirect to not found if title doesn't match with the actual data base? (But this can be a problem because links that people shared won't open) Or what?
Maybe Google has not processed your redirections yet. It may take several weeks and sometimes several months to process all pages, especially if they are not revisited often. Make sure your redirects are 301 and not 302 (temporary).
That being said, there is a better method than redirections for duplicate pages: the canonical tag. If you can, implement it. There is less risk to mix up redirections.
Google can pick your new URL's only after the implementation of 301 redirection through .htaccess file. You should always need to remember that 301 re-direct should be proper and one to one to the new url. After this implementation you need to fetch those new URL via Google Search console so that Google index those URL's fast.

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Hello i have a multistore multidomain prestashop installation with main domain example.com and i want to block all bots from crawling a subdomain site subdomain.example.com made for resellers where they can buy at lower prices because the content is duplicate to the original site, and i am not exacly sure how to do it. Usualy if i want to block the bots for a site i would use
User-agent: *
Disallow: /
But how do i use it without hurting the whole store ? and is it possible to block the bots from the htacces too ?
Regarding your first question:
If you don't want search engines to gain access to the subdomain (sub.example.com/robots.txt), using a robots.txt file ON the subdomain is the way to go. Don't put it on your regular domain (example.com/robots.txt) - see Robots.txt reference guide.
Additionally, I would verify both domains in Google Search Console. There you can monitor and control the indexation of the subdomain and main domain.
Regarding your second question:
I've found a SO thread here which explains what you want to know: Block all bots/crawlers/spiders for a special directory with htaccess.
We use a canonical URL to tell the search engines where to find the original content.
https://yoast.com/rel-canonical/
A canonical URL allows you to tell search engines that certain similar
URLs are actually one and the same. Sometimes you have products or
content that is accessible under multiple URLs, or even on multiple
websites. Using a canonical URL (an HTML link tag with attribute
rel=canonical) these can exist without harming your rankings.

Why google finds a page excluded by robots.txt?

i'm using robots.txt to exclude some pages from spiders.
User-agent: *
Disallow: /track.php
When i search something refeered to this page, google says: "A description for this result is not available because of this site's robots.txt – learn more."
It means that the robots.txt is working.. but why the link to the page is still found by the spider? I'd like to have no link to the 'track.php' page... how i should setup the robots.txt? (or something like .htaccess and so on..?)
Here's what happened:
Googlebot saw, on some other page, a link to track.php. Let's call that page "source.html".
Googlebot tried to visit your track.php file.
Your robots.txt told Googlebot not to read the file.
So Google knows that source.html links to track.php, but it doesn't know what track.php contains. You didn't tell Google not to index track.php; you told Googlebot not to read and index the data inside track.php.
As Google's documentation says:
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
There's not a lot you can do about this. For your own pages, you can use the x-robots-tag or noindex meta tag as described in that documentation. That will prevent Googlebot from indexing the URL if it finds a link in your pages. But if some page that you don't control links to that track.php file, then Google is quite likely to index it.

Resources