mod_rewrite: Check if query contains soft hyphen and remove it - .htaccess

In my http logs I see:
"GET /category/f%C2%ADile-to-download/ HTTP/1.1" 301
instead of "GET /category/file-to-download/ HTTP/1.1" 200
I discovered that %C2%AD is a soft hyphen (invisible symbol).
I need to check if a query to Apache contains a soft hypen and if it does to remove it. Any suggestions on the best method to locate soft hyphen and remove it?
I made some tests with RewriteRule, but got stuck.
Thanks!

As I understand it, mod_rewrite uses un-escaped characters, so in order for you to correctly match the soft-hyphen and then remove it, you would need to edit and save your .htaccess file in UTF-8 encoding (most modern editors will do this).
You will then need to enter a soft-hyphen into your rule. The following will (should!?) remove a single soft-hyphen from your input, but as mentioned it relies on the file being in UTF-8 format:
RewriteRule ([^-]*)-([^-]*) $1$2
Note that you would need to replace the - with the actual UTF-8 dash.
Perhaps an easier option would be this:
RewriteRule ([^\xc2\xad]*)\xc2\xad([^\xc2\xad]*) $1$2 [N]
It uses the specific UTF-8 code you're seeing to remove it from the string. The [N] should rerun all the rewrite rules, which will remove any remaining soft-hyphens.

Thanks #icabod
Currently I got this rule working in my case:
RewriteCond %{REQUEST_URI} \xc2\xad [NC]
RewriteRule ([^\xc2\xad]*)[\xc2\xad]+([^\xc2\xad]*) /$1$2 [N,R=301,L,NC]
.htaccess should be in UTF-8 format as mentioned above.
R=301 - redirect with HTTP code 301
NC - case insensitive
But it doesn't work with two soft hyphens in the different places of the URL like this:
/category/f%C2%ADile-to-d%C2%ADownload/

Related

How to redirect double URL to single URL with htaccess

Google Search Console is showing 404 Page Not Found error for
https://example.com/page/https://example.com/page/
and the link is coming from an external website.
I want to redirect with .htaccess:
https://example.com/page/https://example.com/page/
to
https://example.com/page/
Can anyone can help me in this regard?
Try the following mod_rewrite directives at the top of your .htaccess file:
RewriteEngine On
RewriteRule ^(.*?)https?:/ /$1 [R=301,L]
This just removes any trailing part on the URL-path that starts http:/ (or https:/).
UPDATE: The ? in the capturing subpattern (.*?) makes it non-greedy, so it only captures up to the first occurrence of https:/ and discards the rest, rather than up to the last occurrence (greedy) and looping (redirect loop) until all occurrences of https:/ were removed.
Additional notes:
First test with 302 (temporary) redirect to make sure it works. Only change to 301 when confirmed, to avoid caching issues.
The URL-path that is matched by the RewriteRule pattern has already had sequences of slashes reduced to single slashes, so you can't match // (double slash) here (but I don't think you need to).
If there are query strings involved then you may need a slightly different approach and another directive, since the query string itself (as opposed to the URL-path) might contain the "repeated URL" that needs to be removed (we would need to see an example first). The RewriteRule pattern matches against the URL-path only, not the query string.
On Windows: If the (scheme and) colon (:) appears in the first path segment (ie. the malformed link is for the document root) then Apache will generate a 403 Forbidden before .htaccess is able to redirect. There is nothing you can do to avoid this since it is a limitation of the OS (colons are not allowed in filesystem paths - the 403 occurs when Apache tries to map the URL to a filesystem path). This does not happen on Linux. For example: https://example.com/https://example.com/.
UPDATE: If you are not seeing a redirect, just a 404 then you may need to enable additional pathname information (PATH_INFO) on your URLs. For example, at the top of your .htaccess file:
AcceptPathInfo On

Ho do I write an .htaccess redirect for a directory containing an ellipsis

Some how I had an invalid directory indexed in Google, and because of some dynamic relative links I now have 2500 "missing" pages indexed. I'm trying to use an .htaccess 301 redirect to correct the problem but I can't seem to get it to work. I need to redirect www.domain.com/shop/pc/.../pc/filename.asp to www.domain.com/shop/pc/filename.asp.
The rule I have written that doesn't want to work is RewriteRule ^shop/pc/\.\.\./pc/(.*)$ /shop/pc/$1 [R=301,L]
Any thoughts?
mod_rewite uses PCRE, so for these Unicode characters (I included the two dot leader as well, since I imagine that is more likely to sneak into a URL than an ellipsis):
# U+2026 … \xe2\x80\xa6 HORIZONTAL ELLIPSIS
RewriteRule ^shop/pc/\xe2\x80\xa6/pc/(.*)$ /shop/pc/$1 [R=301,L]
# U+2025 ‥ \xe2\x80\xa5 TWO DOT LEADER
RewriteRule ^shop/pc/\xe2\x80\xa5/pc/(.*)$ /shop/pc/$1 [R=301,L]
Note you may need the [B] flag (see flags) if the browser is percent-escaping the ellipsis.

Using .htaccess to style URL directory style

I have searched this question and looked around but can't seem to get this working in practice. This is my .htaccess file:
Options +FollowSymLinks
RewriteEngine on
RewriteRule /poker/(.*)/(.*)/$ /poker/?$1=$2
I am trying to get my page to work like this:
mysite.com/poker/page/home
But this just isn't working, I have used 3 different generators and tried typing it manually from tutorials but it is just returning a 404. Any idea's a greatly appreciated, it could be really obvious..
Thanks
You do not have a trailing slash in your example, yet your rule requires one. You can make the trailing slash optional:
RewriteEngine on
RewriteRule /poker/(.*)/(.*)/?$ /poker/?$1=$2
Note however, that a uri /poker/a/b/c/d/e/f/g/ is also a match here - a/b/c/d/e/f will match the first subpattern and g will match the second one, because (.*) is greedy. Be more specific if you wish to match only content between slashes - e.g. ([^/]*)
Well, there's really nothing wrong with the rules that you have if http://mysite.com/poker/?page=home resolves correctly. The only thing is that if this is in an htaccess file, the leading slash is removed from the URI when it's matched against in a RewriteRule, so you need to remove it from your regular expression (or maky it optional):
RewriteRule ^poker/(.+)/(.+)/?$ /poker/?$1=$2
And maybe make the groupings (.+) instead so that there is at least one character there.

Mod_Rewrite to /subdirectory and /subdirectory/query

I'm having a difficult time getting into using mod_rewrite. I've been at this for about an hour googling stuff but nothing quite seems to work. What I want to do is change
example.com/species.php into example.com/species
and also
example.com/species.php?name=frog into example.com/species/frog.
Using
Options +FollowSymlinks
RewriteEngine on
RewriteRule ^species/(.*)$ /species.php?name=$1
I can get example.com/species.php?name=frog to display as example.com/species/frog, and with
RewriteRule ^species/ /species.php
I can get example.com/species.php to display as example.com/species/, but I can't get both of them to work at the same time.
Also, example.com/species with no trailing slash always comes up as a 404.
I've considered just making a /species/ directory to catch any problems but I'd rather just have a few rules for one species.php file. Any help would gladly be appreciated!
Edit (because I can't answer my own question for 8 more hours):
I seem to have fixed both of my problems. I changed my .htaccess to:
Options +FollowSymlinks
RewriteEngine on
RewriteRule ^species/(.*)$ /species.php?name=$1
RewriteRule ^species/?$ /species.php
The second RewriteRule successfully redirects example.com/species to example.com/species.php while leaving the other RewriteRule working at the same time.
However, if I typed in example.com/species/ with a trailing slash, it was being read as example.com/species.php?name= and would throw an error because no name was submitted, so I just added
if(isset($_GET['name']) && empty($_GET['name'])) {header('location: http://example.com/species');}
so that if I used example.com/species/ it would redirect to /species and work as desired.
If you change the * (match zero or more) to a + (match one or more) in your first RewriteRule then you should stop seeing species.php?name= if a trailing slash is used.
This is because the + will require that something appears after the slash, otherwise the rule will not match. Then your second RewriteRule will match because it ends with an optional slash, but will not add the name= query string to the target URL.
You may also want to add the [L] flag (last) after the first rule, because you don't need the second rule to execute if the first rule matches. (Note that this will not stop the RewriteCond and RewriteRule tests being run on the resulting redirect URL, which will have to go through the .htaccess file just like any other request.)
See the Reference Documentation for mod_rewrite in Apache 2.4 (or see the docs for the version of Apache you're actually using).

How to redirect a URL containing smartquotes via .htaccess?

Is there a way to redirect a URL containing smart quotes via .htaccess? I'm using the following rules. Only the last one seems to work:
RewriteRule ^8-%E2%80%9Crules%E2%80%9D-for-social-advertising$ /8-rules-for-social-advertising [R=301,L]
RewriteRule ^8-“rules”-for-social-advertising$ /8-rules-for-social-advertising [R=301,L]
RewriteRule ^8-%25E2%2580%259Crules%25E2%2580%259D-for-social-advertising$ /8-rules-for-social-advertising [R=301,L]
When I surf to http://blog.eloqua.com/8-“rules”-for-social-advertising/ or http://blog.eloqua.com/8-%E2%80%9Crules%E2%80%9D-for-social-advertising it doesn't get redirected.
But if I go to http://blog.eloqua.com/8-%25E2%2580%259Crules%25E2%2580%259D-for-social-advertising everything works just fine.
What am I doing wrong? Thanks so much for your help!
You are right, it's slipping passed the rules you have provided.
The reason is because the unicode characters represented by the %E2%80%9C and %E2%80%9D (aka microsoft smartquotes) have already been turned into their unicode representation within Apache. As such you need to properly match the bytestream representing those characters within apache.
In order to properly redirect urls such as this:
http://www.example.com/8-%E2%80%9Crules%E2%80%9D-for-social-advertising
You would use a rule like this:
http://www.example.com/8-\xE2\x80\x9Crules\xE2\x80\x9D-for-social-advertising

Resources