How to ignore/redirect all URLs matching a certain string - .htaccess

I am using the Wordpress plugin, Timely All-in-One events calendar. Unfortunately it is creating a plethora of duplicate URLs which end in strings like (https://www.mywebsite.com/events/action~agenda/page_offset~-2/request_format~json/cat_ids~4) or (https://www.mywebsite.com/events/action~oneday/exact_date~2-4-2019/) for example.
As a consequence of these URL directives each being for a different calendar view but containing the same webpage title and content, some search engines are seeing this as duplicate content. Whilst robots.txt is setup to tell bots to ignore the URLs containing said strings, some crawlers are ignoring robots.txt. I have also disabled the various different calendar views so there is now only the agenda view but even in spite of this, bots continue to crawl these URLs.
Therefore is it possible to use Apache/ a .htaccess directive to tell the server to direct any requests containing "/action~" to either remove the string from the URL so the browser just reads "/events/" or to redirect/forward the URLs to another page.
There are over 500 of these URLs so I ideally would like a quick remedy!
Thanks in advance.

Check this rewrite in your .htaccess file
RewriteEngine On
RewriteRule ^events\/action(.*)$ /events/ [L,R=301]

Related

Self referencing canonical tag in htaccess?

RewriteCond %{REQUEST_URI} !^/assets/pub/pdf-docs/.*$ [NC]
I need all the pdf files in the assets/pub/pdf-docs folder to have a self referencing canonical https header tag.
How can I do this with one(ish) line(s) of code in the htaccess file?
I cannot apply it to just pdf files because the pdfs in assets/pvt/pdf-docs are excluded from indexing.
Many thanks
You could do it like this:
RewriteEngine On
# Set env var CANONICAL_URL
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^assets/pub/pdf-docs/.+\.pdf$ - [E=CANONICAL_URL:https://%{SERVER_NAME}%{REQUEST_URI}]
Header add Link '<%{CANONICAL_URL}e>; rel="canonical"' env=CANONICAL_URL
The mod_rewrite directives set an environment variable CANONICAL_URL if an existing .pdf file in the stated URL-path is requested. The Header directive then sets a rel="canonical" Link header, using this env var (ie. %{CANONICAL_URL}e), but only if this env var is set.
In order to retrieve the canonical hostname, this is dependent on either the hostname already being canonicalised (ie. www vs non-www etc.) prior to these directives OR UseCanonicalName On and ServerName is set appropriately in the server config (otherwise SERVER_NAME is simply the same as HTTP_HOST - the value of the HTTP Host header). If this is not the case then hardcode the canonical hostname in place of %{SERVER_NAME}.
Reference:
https://httpd.apache.org/docs/current/mod/mod_headers.html#header
https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewriterule
Issue is that google is chosing old pdfs that are 404 to be canonical and ignoring correct PDFs. Correct PDFs are in sitemap.
HOWEVER, as I stated in comments, setting this self-referential canonical tag on the new PDF is not going to help you - it's not going to prevent the old PDFs (that return a 404) from appearing in the search results.
For that, you need to 301 (permanent) redirect the old PDFs to the new to inform search engines that the old PDFs have moved to a new URL (and to redirect users from the search engine's search results)*1. A separate sitemap containing only the "old" PDFs (that now redirect) can also help search engines with crawling the old URLs and discovering the redirect. Adding this sitemap to GSC will give you an idea of the index status of these old (and out of date) URLs.
*1 This is assuming that the PDFs have simply changed URL and not entirely new and unrelated. In this case a redirect would not be appropriate and you should serve a "410 Gone" and request these old URLs be removed from the SERPs using Google's URL removal tool to expedite the removal process.
(Adding a self-referential-canonical tag is only going to help if the same PDF is accessible from different URLs - but this is irrelevant to your current issue. And that isn't something you could necessarily do in .htaccess, unless you do it one-by-one for each PDF, or there is a discernible pattern that allows you to generate the canonical URL regardless of the request.)

.htaccess rewrite for orphaned URLs containing underscores and arguments

I only modify the .htaccess with great care for the purposes of my online store.
Some time ago, I did a website migration from osCommerce to OpenCart. This resulted in orphaned osCommerce-style URLs with these two example formats:
http://www.londonpower.com/catalog/product_info.php?products_id=75
http://www.londonpower.com/catalog/product_info.php?cPath=15&products_id=75
Lots of websites in internet-land have links to my old-style URLs, and I have about 100 of them, so I would like to redirect them to new URLs with the following format:
http://www.londonpower.com/2-channel-guitar-preamp
If I understand correctly, the problem has two parts:
to eliminate the underscores, as they baffle the .htaccess engine;
to then perform a 301 redirect on the URL.
So far, I have been able to get the first underscore to change to a hyphen, with this Rewrite Rule:
RewriteRule ^([^_]*)_(.*)$ /$1-$2 [R=301,L]
...but no luck with the second underscore (the one that is part of the query string after the "?"). I am stuck there.
I would avoid using rewriting for this. Does the file catalog/product_info.php exist in the new store? If not, create it and add a simple redirection using a map of old IDs to new URLs. If so, do the same thing in a different file, like old-redirector.php then rewrite requests to it.

HTAccess Redirect using main parameter, ignore all others

firstly I know and understand how to redirect based on parameters :)
My issue is that I need to redirect all links based on the supplied MenuID parameter and ignore any other information in the query string, as not all parameters are used in each web page request, e.g. menuid=2738421; is New Products
http://www.domain.com/shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products& limit=5&page=24
or,
http://www.domain.com/shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products&limit=20&page=3
or,
http://www.domain.com/shop.php?menuid=2738421&menuref=Ja&page=12&limit=15
to
http://www.domain.com/new.html?page=x&limit=x
The reason for the redirection is that search-engines have indexed these pages and so I need to prettify the URLs.
Is this actually possible to create a fuzzy redirect criteria?
## 301 Redirects
# 301 Redirect 1 - works for this explicit URL, but need a partial result
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^new\.html$ http://www.monarchycatering.com/shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products&limit=5&page=24 [R=301,NE,NC,L]
Any help gratefully taken, thank you in advance
Mark.
Sorry for the delay, but StackOverflow doesn't seem to have a way to flag answers that have been replied to and need my attention.
OK, if I understand you correctly, you have an incoming "reduced" semi-SEF URL out in the real world (produced by your links), such as
http://www.domain.com/new.html&limit=5&page=24
("real" SEF would be something like http://www.domain.com/new/limit/5/page/24.html)
and you need to use .htaccess to map it to real files and more Query String information:
http://www.domain.com/shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products&limit=5&page=24
You want to detect new.html for example, and replace it by a fixed string shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products&, using the [QSA] flag to append the existing Query String on to the end?
RewriteEngine On
RewriteRule ^new\.html /shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products [QSA]
RewriteRule ^sale\.html /shop.php?menuid=32424&menuref=Ja&menutitle=Products+On+Sale [QSA]
...etc...
I believe that a & will be stuck on the end of the rewritten address if the user supplied a non-empty Query String, but be sure to test it both ways.
P.S. It probably would have been cleaner to use "comment" to reply to my question, rather than adding another answer.
It's not clear to me what your starting point is and where you're trying to end up. Do you have "pretty" URLs that you want to convert into "non-pretty" Query Strings that your scripts can actually digest?
The reason for the redirection is that search-engines have indexed
these pages and so I need to prettify the URLs.
If the search engines have already indexed the Query String non-pretty version, they'll have to now re-index with pretty URLs. Ditto for all your customers' bookmarks.
If you want to generate "pretty" links within your site, and decode them (in .htaccess) upon re-entry to non-pretty Query Strings, that should work. Your customers' existing bookmarks should even continue to work, while the search engines will replace the non-pretty with the pretty URLs.
and thanks for the interest in my question...
I have rewritten parts of my website and Google still has references to the old MenuID parameter and shop.php configuration, but now I rewriten the Query to a prettier format, e.g.
http://www.domain.com/shop.php?menuid=2738421&menuref=Ja&menutitle=New+Products&limit=5&page=24
is now
http://www.domain.com/new.html&limit=5&page=24
The pages represent product categories, and so needed to be displayed in a more meaningful manner. Customer bookmarking is not an issue, as long as I can redirect the pages.
I hope that makes sense, best wishes,
Mark.

using mod_rewrite to create SEO friendly URLS

I've been searching google for this but can't find the solution to my exact needs. Basically I've already got my URL's named how I like them i.e. "http://mysite.com/blog/page1.php"
What I'm trying to achieve (if it's possible!) is to use rewrite to alter the existing URLS to: "http://mysite.com/blog/page1"
The problem I've come across is I've found examples that will do this if the user enters "http://mysite.com/blog/page1" into the broweser which is great, however I need it to work for the existing links in google as not to loose traffic, so incoming URLS "http://mysite.com/blog/page1.php" are directed to "http://mysite.com/blog/page1".
The 1st example (Canonical URLs) at the following is pretty much what you want:
http://httpd.apache.org/docs/2.0/misc/rewriteguide.html#url
This should do the trick, rewriting requests without .php to have it, invisible to the user.
RewriteEngine On
RewriteRule ^/blog/([^.]+)$ /blog/$1.php
You will need to write a rewrite rule for mapping your old url's to your new url as a permanent redirect. This will let the search engine know that the new, seo friendly url's are the ones to be used.
RewriteRule blog/page1.php blog/page1 [R=301,L]

Getting "mywebsite.org/" to resolve to "mywebsite.org/index.php"

At my work we have various web pages that, my boss feels, are being ranked lower than they should be because "mywebsite.org/category/" looks like a different URL to search engines than "mywebsite.org/category/index.php" does, even though they show the same file. I don't think it works this way but he's convinced. Maybe I'm wrong though. I have two questions:
How do i make it so that it will say "index.php" in the address bar of all subcategories?
Is this really how pagerank works?
Besides changing all the links everywhere, a simpler solution is to use a rewrite rule. Make sure it is a permanent redirect, or Google will keep using the old link (without index.php). How you do this exactly depends on your web server, but for Apache HTTPd it looks something like the example given below.
Yes. Or so I've heard. Very few people know for sure. But Google mentions this guideline (as "Be consistent"). Make sure to check out all of Google's Webmaster guidelines.
Apache config for rewrite rule:
# in the generic config
LoadModule rewrite_module modules/mod_rewrite.so
# in your virutal host
RewriteEngine On
# redirect everything that ends in a slash to the same, but with index.php added
RewriteRule ^(.*)/$ $1/index.php [R=301,L]
# or the other way around, as suggested
# RewriteRule ^(.*)/index.php$ $1/ [R=301,L]
Adding this code to the top of every page should also work:
<?php
if (substr($_SERVER['REQUEST_URI'], -1) == '/') {
$new_request_uri = $_SERVER['REQUEST_URI'].'index.php';
header('HTTP/1.1 301 Moved Permanently');
header('Location: '.$new_request_uri);
exit;
}
?>
You don't tell us if you're using straight PHP or some other framework, but for PHP, probably you just need to change all the links on your site to "mywebsite.org/category/index.php".
I think it's possible that this does affect your search engine rank. However, you would be better off using only "mywebsite.org/category" rather than adding "index.php" to each one.
Bottom line is that you need to make sure all your links in your website use one or the other. What actually gets shown in the address bar is unimportant.
A simple solution is to put in the <head> tag:
<link rel="canonical" href="http://mywebsite.org/category/" />
Then, no matter which page the search engine ends up on, it will know it is simply a different view of /category/
And for your second question--yes, it can affect your results, if Google thinks you are spamming. If it wasn't, they wouldn't have added support for rel="canonical". Although I wouldn't be surprised if they treat somedir/index.* the same as somedir/
I'm not sure if /category/ and /category/index.php are considered two urls for seo, but there is a good chance that it will effect them, one way or another. There is nothing wrong with making a quick change just to be sure.
A few thoughts:
URLs
Rather than adding /index.php, you will be better off making it so there is no index.php on any of them, since the keyword 'index' is probably not what you want.
You can make a script that will check if the URL of the current page ends in index.php and remove it, then forward to the resulting URL.
For example, on one of my sites, I require the 'www.' for my domain (www.domain.com and domain.com are considered two URLs for search purposes, though not always), so I have a script that checks each page and if there is no www., it ads it, and forwards.
if (APPLICATION_LIVE) {
if ( (strtolower($_SERVER["HTTP_HOST"]) != "www.domain.com") ) {
header("HTTP/1.1 301 Moved Permanently"); // Recognized by search engines and may count the link toward the correct URL...
header("Location: " . 'www.domain.com/'.$_SERVER["REQUEST_URI"] );
exit();
}
}
You could mode that to do what you need.
That way, if a crawler visits the wrong URL, it will be notified that it was replaced with the correct URL. If a person visits the wrong URL, they will be forwarded to the correct URL (most won't notice), and then if they copy the url from the browser to send someone or link to that page, they will end up linking to the correct url for that page.
LINKING URLS
They way other pages link to your pages is more important for seo. Make sure all your in-site links use the proper URL (without /index.php), and that if you have a 'link to this page' feature, it doesn't include the /index.php part. You can't control how everyone links to you, but you can take some control over it, like with the script in item 1.
URL ROUTING
You may also want to consider using some sort of framework or stand-alone URL rerouting scheme. It could make it so there were more keywords, etc.
See here for an example: http://docs.kohanaphp.com/general/routing
I agree with everyone who's saying to ditch the index.php. Please don't force your visitor to type index.php if not typing it could get them the same result.
You didn't say if you're on an IIS or Apache server.
IIS can be set to assume index.php is the default page so that http:// mywebsite.org/ will resolve correctly without including index.php.
I would say that if you want to include the default page and force your users to type the page name in the url, make the page name meaningful to a search engine and to your visitors.
Example:
http://mywebsite.org/teaching-web-scripting.php
is far more descriptive and beneficial for SEO rankings than just
http://mywebsite.org/index.php
Might want to take a look at robots.txt files? Not quite the best solution, but you should be able to implement something workable with them...

Resources