Self referencing canonical tag in htaccess? - .htaccess

RewriteCond %{REQUEST_URI} !^/assets/pub/pdf-docs/.*$ [NC]
I need all the pdf files in the assets/pub/pdf-docs folder to have a self referencing canonical https header tag.
How can I do this with one(ish) line(s) of code in the htaccess file?
I cannot apply it to just pdf files because the pdfs in assets/pvt/pdf-docs are excluded from indexing.
Many thanks

You could do it like this:
RewriteEngine On
# Set env var CANONICAL_URL
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^assets/pub/pdf-docs/.+\.pdf$ - [E=CANONICAL_URL:https://%{SERVER_NAME}%{REQUEST_URI}]
Header add Link '<%{CANONICAL_URL}e>; rel="canonical"' env=CANONICAL_URL
The mod_rewrite directives set an environment variable CANONICAL_URL if an existing .pdf file in the stated URL-path is requested. The Header directive then sets a rel="canonical" Link header, using this env var (ie. %{CANONICAL_URL}e), but only if this env var is set.
In order to retrieve the canonical hostname, this is dependent on either the hostname already being canonicalised (ie. www vs non-www etc.) prior to these directives OR UseCanonicalName On and ServerName is set appropriately in the server config (otherwise SERVER_NAME is simply the same as HTTP_HOST - the value of the HTTP Host header). If this is not the case then hardcode the canonical hostname in place of %{SERVER_NAME}.
Reference:
https://httpd.apache.org/docs/current/mod/mod_headers.html#header
https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewriterule
Issue is that google is chosing old pdfs that are 404 to be canonical and ignoring correct PDFs. Correct PDFs are in sitemap.
HOWEVER, as I stated in comments, setting this self-referential canonical tag on the new PDF is not going to help you - it's not going to prevent the old PDFs (that return a 404) from appearing in the search results.
For that, you need to 301 (permanent) redirect the old PDFs to the new to inform search engines that the old PDFs have moved to a new URL (and to redirect users from the search engine's search results)*1. A separate sitemap containing only the "old" PDFs (that now redirect) can also help search engines with crawling the old URLs and discovering the redirect. Adding this sitemap to GSC will give you an idea of the index status of these old (and out of date) URLs.
*1 This is assuming that the PDFs have simply changed URL and not entirely new and unrelated. In this case a redirect would not be appropriate and you should serve a "410 Gone" and request these old URLs be removed from the SERPs using Google's URL removal tool to expedite the removal process.
(Adding a self-referential-canonical tag is only going to help if the same PDF is accessible from different URLs - but this is irrelevant to your current issue. And that isn't something you could necessarily do in .htaccess, unless you do it one-by-one for each PDF, or there is a discernible pattern that allows you to generate the canonical URL regardless of the request.)

Related

URL rewrite with HTACCESS for multiple language support

I am looking to add multiple language support to my website. Is it possible to use the .htaccess file to change something like:
example.com/dir/?lang=en to example.com/en/dir/
example.com/main/?lang=de to example.com/de/main/
example.com/main/page.php?lang=de to example.com/de/main/page.php
Where this works with any possible directories - so if for instance later on I made a new directory, I wouldn't need to add to this. In the above example, I want the latter to be what the user types in/is in the address bar, and the start to be how it is used internally.
Is it possible to use the .htaccess file to change something like:
example.com/dir/?lang=en to example.com/en/dir/
Yes, except that you don't change the URL to example.com/en/dir/ in .htaccess. You change the URL to example.com/en/dir/ in your internal links in your application, before you change anything in .htaccess. This is the canonical URL and is "what the user types in/is in the address bar" - as you say.
You then use .htaccess to internally rewrite the request from example.com/en/dir/, back into the URL your application understands, ie. example.com/dir/?lang=en (or rather example.com/dir/index.php?lang=en - see below). This is entirely hidden from the user. The user only ever sees example.com/en/dir/ - even when they look at the HTML source.
So, we need to rewrite /<lang>/<url-path>/ to /<url-path>/?lang=<lang>. Where <lang> is assumed to be a 2 character lowercase language code. If you are offering only a small selection of languages then this should be explicitly stated to avoid conflicts. We can also handle any additional query string on the original request (if this is requried). eg. /<lang>/<url-path>/?<query-string> to /<url-path>/?lang=<lang>&<query-string>.
A slight complication here is that a URL of the form /dir/?lang=en is not strictly a valid endpoint and requires further rewriting. I expect you are relying on mod_dir to issue an internal subrequest for the DirectoryIndex, eg. index.php? So, really, this should be rewritten directly to /dir/index.php?lang=en - or whatever the DirectoryIndex document is defined as.
For example, in your root .htaccess file:
RewriteEngine On
# Rewrite "/<lang>/<directory>/" to `/<directory>/index.php?lang=<lang>"
RewriteCond %{DOCUMENT_ROOT}/$2/index.php -f
RewriteRule ^([a-z]{2})/(.*?)/?$ $2/index.php?lang=$1 [L]
# Rewrite "/<lang>/<directory>/<file>" to `/<directory>/<file>?lang=<lang>"
RewriteCond %{DOCUMENT_ROOT}/$2 -f
RewriteRule ^([a-z]{2})/(.+) $2?lang=$1 [L]
If you have just two languages (as in your example), or a small subset of known languages then change the ([a-z]{2}) subpattern to use alternation and explicitly identify each language code, eg. (en|de|ab|cd).
This does assume you don't have physical directories in the document root that consist of 2 lowercase letters (or match the specific language codes).
Only URLs where the destination directory (that contains index.php) or file exists are rewritten.
This will also rewrite requests for the document root (not explicitly stated in your examples). eg. example.com/en/ (trailing slash required here) is rewritten to /index.php?lang=en.
The regex could be made slightly more efficient if requests for directories always contain a trailing slash. In the above I've assumed the trailing slash is optional, although this does potentially create a duplicate content issue unless you resolve this in some other way (eg. rel="canonical" link element). So, in the code above, both example.com/en/dir/ (trailing slash) and example.com/en/dir (no trailing slash) are both accessible and both return the same resource, ie. /dir/index.php?lang=en.

How to ignore/redirect all URLs matching a certain string

I am using the Wordpress plugin, Timely All-in-One events calendar. Unfortunately it is creating a plethora of duplicate URLs which end in strings like (https://www.mywebsite.com/events/action~agenda/page_offset~-2/request_format~json/cat_ids~4) or (https://www.mywebsite.com/events/action~oneday/exact_date~2-4-2019/) for example.
As a consequence of these URL directives each being for a different calendar view but containing the same webpage title and content, some search engines are seeing this as duplicate content. Whilst robots.txt is setup to tell bots to ignore the URLs containing said strings, some crawlers are ignoring robots.txt. I have also disabled the various different calendar views so there is now only the agenda view but even in spite of this, bots continue to crawl these URLs.
Therefore is it possible to use Apache/ a .htaccess directive to tell the server to direct any requests containing "/action~" to either remove the string from the URL so the browser just reads "/events/" or to redirect/forward the URLs to another page.
There are over 500 of these URLs so I ideally would like a quick remedy!
Thanks in advance.
Check this rewrite in your .htaccess file
RewriteEngine On
RewriteRule ^events\/action(.*)$ /events/ [L,R=301]

htaccess How to redirect old dynamic url to the new one

My web uses links which are dynamic set by code in htaccess (bellow):
RewriteRule ^(.*),(.*),([a-z0-9-_.]+),([a-z0-9-_.]+),([a-z0-9-_.]+)$ $4.php?n=$1&z=$2&t=$3&v=$5 [L,NC,NS,NE]
In effect links looks like this (example):
www.mypage.com/$1,$2,$3,$4,$5
I want to redirect dynamic links in htaccess from old to new one which will have a structure like this (without $5 parameter):
www.mypage.com/$4/$1-$2/$3
Redirection is necessarily especially for redirect old links availble in search engines to new one.
Thanks for a help.
I don't have an Apache instance to hand to test against just now but off the top of my head something like this should work:
RewriteRule ^(.*),(.*),([a-z0-9-_.]+),([a-z0-9-_.]+),([a-z0-9-_.]+)$ $4/$1-$2/$3 [L,R=301]
RewriteRule ^([a-z0-9-_.]+)/(.*)-(.*)/([a-z0-9-_.]+)$ $1.php?n=$2&z=$3&t=$4 [L,NC,NS,NE]
The first rewrite redirects your old URLs (1,2,3,4,5) to the new ones (4/1-2/3) using a 301 to tell search engines to drop the old URLs in favour of the new ones.
The second rewrite takes the new format and maps it to your actual script.
Note how the 5th param is dropped when transforming old to new.

.htaccess rewrite rule for removing numerical id in the path

I have an over 10 yr old website with lots of external links.
The URL format is like top-level/show/12345/text-name.
I'm in the process of upgrading the system and the new system supports more user friendly URL with out the 12345 numerical id, like; top-level/show/text-name.
I'm planning on migrating existing contents preserving the text-name.
How do I specify in .htaccess to remove the /12345 level in the path?
12345 can be any number, thousands.
Change any old url on your site as far as they are not already automatically changed. Make sure that mod_rewrite is enabled and allowed to work (FollowSymLinks must be allowed). Then add the following rule:
RewriteEngine on
RewriteRule ^show/[0-9]+/([^/]+)/?$ show/$1 [R,L]
Change the R flag to R=301 once you have tested that the redirect works as expected. Changing this will mark this as a permanent redirect. You also might want to remind any users that they should update their bookmarks.

links includes coma and hash htaccess

I'm using links like that:
find.php?id=1&n=2#loc
I want to my links look like:
but i dont know how to change htaccess and when/where use # to link some place in the page #loc
htaccess:
RewriteRule ^(.*),(.*),(.*)$ $3.php?id=$2&n=$1 [L,NC,NS,NE]
Help lease
The #loc part of the URL is never sent to the server, so there's no way for you to match against it in your htaccess file. Anything starting from the # in the URL is called a fragment and it's used by the browser, the server doesn't know it's even there.

Resources