encoding with .htaccess

encoding with .htaccess - .htaccess

To access a page in my website I use this rule
^/map/(.+)\-([0-9]{1,5})/map-.+\-street\-([a-z])\.html$
The link I use is www.mydomain.com/map/neus%e4%df-86/map-neus%e4%df-street-a.html.
When I'll chekc the IIS7 rewrite log the link is transformed to /map/neusä߆/map-neusäß-street-a.html, and of course I got a 404 error.
So, my question is, why is neus%e4%df-86 transormed to neusä߆
instead of neustäß
(P.S. the page-s encoding is charset=iso-8859-1)

So, I found the solution. The [NE] flag was not set at the end of the rule. Related to the Helicon's documentation, "NE: Don't escape output. By default ISAPI_Rewrite will encode all non-ANSI characters as %xx hex codes in output." This way the link remains unencoded.

Related

Encoded URL in htaccess 301 redirect (mod_rewrite)

I am lost after digging into this matter for a lot of days. We have the following redirects:
RewriteRule ^something/something2/?$ http://test.com/blabla?key=blablabla1287963%3D [R=301,L,E=OUTLINK:1]
Header always set X-Robots-Tag "noindex" env=OUTLINK
Unfortunately that %3D got stripped by the module (mod_rewrite). The main problem is that I know how to manually fix it but I have multiple similar redirects and I need a "global solution". Please note that moving back to redirect 301 (I had no issues with redirect 301 and encoded URLs/characters) is not an option since I want to use noindex...
Thank you!

that %3D got stripped by the module
I think you'll find that it's the %3 that gets stripped, not %3D. %3 is seen as a backreference to a preceding condition - which I suspect doesn't exist - so gets replaced with an empty string in the substitution. (This would not have been a problem with Redirect since %N backreferences aren't a thing with mod_alias.)
You need to backslash escape the % to represent a literal % in the substitution string in order to negate its special meaning in this case.
You will then need the NE flag on the RewriteRule to prevent the % itself from being URL encoded (as %25) in the response (essentially doubly encoding the URL param value).
For example:
RewriteRule ^foo$ http://test.com/blabla?key=blablabla1287961\%3D [NE,R=302,L,E=OUTLINK:1]
I have multiple similar redirects and I need a "global solution"
As far as a "global solution" goes, there isn't a magic switch you can enable on the server that will "fix" this. You need to modify each directive where this conflict occurs.

What is the smallest URL friendly encoding?

I am looking for a url encoding method that is most efficient in terms of space. Raw binary (base2) could be represented in base16 which is smaller and is url safe, but base64 is even more efficient. However, the usual base64 encoding isn't url safe....
So what is the smallest encoding method that is also safe for URLS?

This is what the Base64 URL encoding variant is for.
It uses the same standard Base64 Alphabet except that + is changed to - and / is changed to _.
Most modern Base64 implementations will support this alternate encoding. If yours doesn't, it's usually just a matter of doing a search/replace on the Base64 input prior to decoding, or on the output prior to sending it to a browser.

You can use a 62 character representation instead of the usual base 64. This will give you URLs like the youtube ones:
http://www.youtube.com/watch?v=0JD55e5h5JM
You can use the PHP functions provided in this page if you need to map strings to a database numerical ID:
http://bsd-noobz.com/blog/how-to-create-url-shortening-service-using-simple-php
Or this one if you need to directly convert a numerical ID to a short URL string:
http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/

"base66" (theoretical, according to spec)
As far as I can tell, the optimal encoding for URLs is a "base66" encoding into the following alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789-_.~
These are all the "Unreserved characters" according the URI specification RFC 3986 (section 2.3), so they will appear as-is in the URL. Using this "base66" encoding could give a URL like:
https://example.org/articles/.3Ja~jkWe
The question is then if you want . and ~ in your URLs?
On some older servers (ancient by now, I guess) ~joe would mean the "www directory" of the user joe on this server. And thus a user might be confused as to what the ~ character is doing in the middle of your URL.
This is common for academic websites, especially CS professors (e.g. Donald Knuth's website https://www-cs-faculty.stanford.edu/~knuth/)
"base80" (in practice, but not battle-tested)
However, in my own testing the following 14 other symbols also do not get
percent-encoded (in Chrome 95 and Firefox 93):
!$'()*+,:;=#[]
(see also this StackOverflow answer)
leaving a "base80" URL encoding possible. Some of these (notably + and =) would not work in the query string part of the URL, only in the path part. All in all, this ends up giving you beautiful, hyper-compressed URLs like:
https://example.org/articles/1OWG,HmpkySCbBy#RG6_,
https://example.org/articles/21Cq-b6Ud)txMEW$,hc4K
https://example.org/articles/:3Tx**U9X'd;tl~rR]q+
There's a plethora of reasons why you might not want all of those symbols in your URLs. One example is that StackOverflow's own "linkifier" won't include that ending comma in the link it generates (I've manually made it a part of the link here).
Also the percent encoding seems to be quite finicky. In some cases Firefox would initially percent-encode ' and ~] but on later requests would not.

URL Rewrite with 3 parameters for forum .htaccess

I am trying to URL Rewrite to the following URL
Http://***.com/index.php?p=forum&mod=view_posts&page=$3&name=$2&id=$1
Http://***.com/forum/{id}-{name}/{page}
Http://***.com/forum/1-Hello-World/1
I have tryed the following code and have had no joy
RewriteRule ^forum/([^-]+)-([^&]+)/([^-]+)$ index.php?p=forum&mod=view_posts&page=$3&orderby=$2&id=$1
Thanks

That regex isn't very good: you see, the "([^&]+)" says: "one or more characters, up until the first ampersand", while you have no ampersands in the subject. Also, the "([^-]+)$" says "one or more characters before a hyphen", while you don't intend to end the subject with a hyphen.
Try this one:
^forum/([^-]+)-([^/]+)/(.+)$
But note that this actually captures any characters in the id and page positions, so you might be better off with
^forum/([0-9]+)-([^/]+)/([0-9]+)$
as that allows only numbers in those positions.
Also, you probably meant "index.php?p=forum&mod=view_posts&page=$3&name=$2&id=$1" instead of "index.php?p=forum&mod=view_posts&page=$3&orderby=$2&id=$1"

urlencoded Forward slash is breaking URL

About the system
I have URLs of this format in my project:-
http://project_name/browse_by_exam/type/tutor_search/keyword/class/new_search/1/search_exam/0/search_subject/0
Where keyword/class pair means search with "class" keyword.
I have a common index.php file which executes for every module in the project. There is only a rewrite rule to remove the index.php from URL:-
RewriteCond $1 !^(index\.php|resources|robots\.txt)
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php [L,QSA]
I am using urlencode() while preparing the search URL and urldecode() while reading the search URL.
Problem
Only the forward slash character is breaking URLs causing 404 page not found error.
For example, if I search one/two the URL is
http://project_name/browse_by_exam/type/tutor_search/keyword/one%2Ftwo/new_search/1/search_exam/0/search_subject/0/page_sort/
How do I fix this? I need to keep index.php hidden in the URL. Otherwise, if that was not needed, there would have been no problem with forward slash and I could have used this URL:-
http://project_name/index.php?browse_by_exam/type/tutor_search/keyword/one
%2Ftwo/new_search/1/search_exam/0/search_subject/0

Apache denies all URLs with %2F in the path part, for security reasons: scripts can't normally (ie. without rewriting) tell the difference between %2F and / due to the PATH_INFO environment variable being automatically URL-decoded (which is stupid, but a long-standing part of the CGI specification so there's nothing can be done about it).
You can turn this feature off using the AllowEncodedSlashes directive, but note that other web servers will still disallow it (with no option to turn that off), and that other characters may also be taboo (eg. %5C), and that %00 in particular will always be blocked by both Apache and IIS. So if your application relied on being able to have %2F or other characters in a path part you'd be limiting your compatibility/deployment options.
I am using urlencode() while preparing the search URL
You should use rawurlencode(), not urlencode() for escaping path parts. urlencode() is misnamed, it is actually for application/x-www-form-urlencoded data such as in the query string or the body of a POST request, and not for other parts of the URL.
The difference is that + doesn't mean space in path parts. rawurlencode() will correctly produce %20 instead, which will work both in form-encoded data and other parts of the URL.

Replace %2F with %252F after url encoding
PHP
function custom_http_build_query($query=array()){
return str_replace('%2F','%252F', http_build_query($query));
}
Handle the request via htaccess
.htaccess
RewriteCond %{REQUEST_URI} ^(.*?)(%252F)(.*?)$ [NC]
RewriteRule . %1/%3 [R=301,L,NE]
Resources
http://www.leakon.com/archives/865

In Apache, AllowEncodedSlashes On would prevent the request from being immediately rejected with a 404.
Just another idea on how to fix this.

$encoded_url = str_replace('%2F', '/', urlencode($url));

I had the same problem with slash in url get param, in my case following php code works:
$value = "hello/world"
$value = str_replace('/', '/', $value;?>
$value = urlencode($value);?>
# $value is now hello%26%2347%3Bworld
I first replace the slash by html entity and then I do the url encoding.

Here's my humble opinion. !!!! Don't !!!! change settings on the server to make your parameters work correctly. This is a time bomb waiting to happen someday when you change servers.
The best way I have found is to just convert the parameter to base 64 encoding. So in my case, I'm calling a php service from Angular and passing a parameter that could contain any value.
So my typescript code in the client looks like this:
private encodeParameter(parm:string){
if (!parm){
return null;
}
return btoa(parm);
}
And to retrieve the parameter in php:
$item_name = $request->getAttribute('item_name');
$item_name = base64_decode($item_name);

On my hosting account this problem was caused by a ModSecurity rule that was set for all accounts automatically. Upon my reporting this problem, their admin quickly removed this rule for my account.

Use a different character and replace the slashes server side
e.g. Drupal.org uses %21 (the excalamation mark character !) to represent the slash in a url parameter.
Both of the links below work:
https://api.drupal.org/api/drupal/includes%21common.inc/7
https://api.drupal.org/api/drupal/includes!common.inc/7
If you're worried that the character may clash with a character in the parameter then use a combination of characters.
So your url would be
http://project_name/browse_by_exam/type/tutor_search/keyword/one_-!two/new_search/1/search_exam/0/search_subject/0
change it out with js and convert it back to a slash server side.

is simple for me use base64_encode
$term = base64_encode($term)
$url = $youurl.'?term='.$term
after you decode the term
$term = base64_decode($['GET']['term'])
this way encode the "/" and "\"

A standard solution for this problem is to allow slashes by making the parameter that may contain slashes the last parameter in the url.
For a product code url you would then have...
mysite.com/product/details/PR12345/22
For a search term you'd have
http://project/search_exam/0/search_subject/0/keyword/Psychology/Management
(The keyword here is Psychology/Management)
It's not a massive amount of work to process the first "named" parameters then concat the remaining ones to be product code or keyword.
Some frameworks have this facility built in to their routing definitions.
This is not applicable to use case involving two parameters that my contain slashes.

I use javascript encodeURI() function for the URL part that has forward slashes that should be seen as characters instead of http address.
Eg:
"/api/activites/" + encodeURI("?categorie=assemblage&nom=Manipulation/Finition")
see http://www.w3schools.com/tags/ref_urlencode.asp

I solved this by using 2 custom functions like so:
function slash_replace($query){
return str_replace('/','_', $query);
}
function slash_unreplace($query){
return str_replace('_','/', $query);
}
So to encode I could call:
rawurlencode(slash_replace($param))
and to decode I could call
slash_unreplace(rawurldecode($param);
Cheers!

You can use %2F if using it this way:
?param1=value1&param2=value%2Fvalue
but if you use /param1=value1/param2=value%2Fvalue it will throw an error.

Problem using unicode in URLs with cgi.PATH_INFO in ColdFusion

My ColdFusion (MX7 on IIS 6) site has search functionality which appends the search term to the URL e.g. http://www.example.com/search.cfm/searchterm.
The problem I'm running into is this is a multilingual site, so the search term may be in another language e.g. القاهرة leading to a search URL such as http://www.example.com/search.cfm/القاهرة
The problem is when I come to retrieve the search term from the URL. I'm using cgi.PATH_INFO to retrieve the path of the search page and the search term and extracting the search term from this e.g. /search.cfm/searchterm however, when unicode characters are used in the search they are converted to question marks e.g. /search.cfm/??????.
These appear actual question marks, rather than the browser not being able to format unicode characters, or them being mangled on output.
I can't find any information about whether ColdFusion supports unicode in the URL, or how I can go about resolving this and getting hold of the complete URL in some way - does anyone have any ideas?
Cheers,
Tom
Edit: Further research has lead me to believe the issue may related to IIS rather than ColdFusion, but my original query still stands.
Further edit
The result of GetPageContext().GetRequest().GetRequestUrl().ToString() is http://www.example.com/search.cfm/searchterm/????? so it appears the issue goes fairly deep.

Yeah, it's not really ColdFusion's fault. It's a common problem.
It's mostly the fault of the original CGI specification, which specifies that PATH_INFO has to be %-decoded, thus losing the original %xx byte sequences that would have allowed you to work out which real characters were meant.
And it's partly IIS's fault, because it always tries to read submitted %xx bytes in the path part as UTF-8-encoded Unicode (unless the path isn't a valid UTF-8 byte sequence in which case it plumps for the Windows default code page, but gives you no way to find out this has happened). Having done so, it puts it in environment variables as a Unicode string (as envvars are Unicode under Windows).
However most byte-based tools using the C stdio (and I'm assuming this applies to ColdFusion, as it does under Perl, Python 2, PHP etc.) then try to read the environment variables as bytes, and the MS C runtime encodes the Unicode contents again using the Windows default code page. So any characters that don't fit in the default code page are lost for good. This would include your Arabic characters when running on a Western Windows install.
A clever script that has direct access to the Win32 GetEnvironmentVariableW API could call that to retrieve a native-Unicode environment variable which they could then encode to UTF-8 or whatever else they wanted, assuming that the input was also UTF-8 (which is what you'd generally want today). However, I don't think CodeFusion gives you this access, and in any case it only works from IIS6 onwards; IIS5.x will throw away any non-default-codepage characters before they even reach the environment variables.
Otherwise, your best bet is URL-rewriting. If a layer above CF can convert that search.cfm/القاهرة to search.cfm/?q=القاهرة then you don't face the same problem, as the QUERY_STRING variable, unlike PATH_INFO, is not specified to be %-decoded, so the %xx bytes remain where a tool at CF's level can see them.

Here's what you could do:
<cfset url.searchTerm = URLEncodedFormat("القاهر", "utf-8") >
<cfset myVar = URLDecode(url.searchTerm , "utf-8") >
Ofcourse, I'd recommend that you work with something like this in that case:
yourtemplate.cfm?searchTerm=%C3%98%C2%A7%C3%99%E2%80%9E
And then you do URL rewriting in IIS (if not already done by framework/rest of the app) http://learn.iis.net/page.aspx/461/creating-rewrite-rules-for-the-url-rewrite-module/ to match your pattern.

You can set the character encoding of the URL and FORM scope using the setEncoding() function:
http://www.adobe.com/livedocs/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/html/wwhelp.htm?context=ColdFusion_Documentation&file=00000623.htm
You need to do this before you access any of the variables in this scope.
But, the default encoding of those scopes is already UTF-8, so this may not help. Also, this would probably not affect the CGI scope.
Is the IIS Server logging the correct characters into the request log?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string