Node.js URL-encoding for pre-RFC3986 urls (using + vs %20) - node.js

Within Node.js, I am using querystring.stringify() to encode an object into a query string for usage in a URL. Values that have spaces are encoded as %20.
I'm working with a particularly finicky web service that will only accept spaces encoded as +, as used to be commonly done prior to RFC3986.
Is there a way to set an option for querystring so that it encodes spaces as +?
Currently I am simply doing a .replace() to replace all instances of %20 with +, but this is a bit tedious if there is an option I can set ahead of time.

If anyone still facing this issue, "qs" npm package has feature to encode spaces as +
qs.stringify({ a: 'b c' }, { format : 'RFC1738' })

I can't think of any library doing that by default, and unfortunately, I'd say your implementation may be the more efficient way to do this, since any other option would probably either do what you're already doing, or will use slower non-compiled pure JavaScript code.
What about asking the web service provider to follow the RFC?

https://github.com/kvz/phpjs is a node.js package that provides all the php functions. The http_build_query implementation at the time of writing this only supports urlencode (the query string includes + instead of spaces), but hopefully soon will include the enc_type parameter / rawurlencode (%20's for spaces).
See http://php.net/http_build_query.
RFC1738 (+'s) will be the default enc_type either way, so you can use it immediately for your purposes.

Related

Node.js JavaScript-stringify

JSON is not a subset of JavaScript. I need my output to be 100% valid JavaScript; it will be evaluated as such -- i.e., JSON.stringify will not (always) work for my needs.
Is there a JavaScript stringifier for Node?
As a bonus, it would be nice if it could stringify objects.
You can use JSON.stringify and afterwards replace the remaining U+2028 and U+2029 characters. As the article linked states, the characters can only occur in the strings, so we can safely replace them by their escaped versions without worrying about replacing characters where we should not be replacing them:
JSON.stringify('ro\u2028cks').replace(/\u2028/g,'\\u2028').replace(/\u2029/g,'\\u2029')
From the last paragraph in the article you linked:
The solution
Luckily, the solution is simple: If we look at the JSON specification we see that the only place where a U+2028 or U+2029 can occur is in a string. Therefore we can simply replace every U+2028 with \u2028 (the escape sequence) and U+2029 with \u2029 whenever we need to send out some JSONP.
It’s already been fixed in Rack::JSONP and I encourage all frameworks or libraries that send out JSONP to do the same. It’s a one-line patch in most languages and the result is still 100% valid JSON.

What's the name for hyphen-separated case?

This is PascalCase: SomeSymbol
This is camelCase: someSymbol
This is snake_case: some_symbol
So my questions is whether there is a widely accepted name for this: some-symbol? It's commonly used in url's.
There isn't really a standard name for this case convention, and there is disagreement over what it should be called.
That said, as of 2019, there is a strong case to be made that kebab-case is winning:
https://trends.google.com/trends/explore?date=all&q=kebab-case,spinal-case,lisp-case,dash-case,caterpillar-case
spinal-case is a distant second, and no other terms have any traction at all.
Additionally, kebab-case has entered the lexicon of several javascript code libraries, e.g.:
https://lodash.com/docs/#kebabCase
https://www.npmjs.com/package/kebab-case
https://v2.vuejs.org/v2/guide/components-props.html#Prop-Casing-camelCase-vs-kebab-case
However, there are still other terms that people use. Lisp has used this convention for decades as described in this Wikipedia entry, so some people have described it as lisp-case. Some other forms I've seen include caterpillar-case, dash-case, and hyphen-case, but none of these is standard.
So the answer to your question is: No, there isn't a single widely-accepted name for this case convention analogous to snake_case or camelCase, which are widely-accepted.
It's referred to as kebab-case. See lodash docs.
It's also sometimes known as caterpillar-case
This is the most famous case and It has many names
kebab-case: It's the name most adopted by official software
caterpillar-case
dash-case
hyphen-case or hyphenated-case
lisp-case
spinal-case
css-case
slug-case
friendly-url-case
As the character (-) is referred to as "hyphen" or "dash", it seems more natural to name this "dash-case", or "hyphen-case" (less frequently used).
As mentioned in Wikipedia, "kebab-case" is also used. Apparently (see answer) this is because the character would look like a skewer... It needs some imagination though.
Used in lodash lib for example.
Recently, "dash-case" was used by
Angular (https://angular.io/guide/glossary#case-types)
NPM modules
https://www.npmjs.com/package/case-dash (removed ?)
https://www.npmjs.com/package/dasherize
Adding the correct link here Kebab Case
which is All lowercase with - separating words.
I've always called it, and heard it be called, 'dashcase.'
There is no standardized name.
Libraries like jquery and lodash refer it as kebab-case. So does Vuejs javascript framework. However, I am not sure whether it's safe to declare that it's referred as kebab-case in javascript world.
I've always known it as kebab-case.
On a funny note, I've heard people call it a SCREAM-KEBAB when all the letters are capitalized.
Kebab Case Warning
I've always liked kebab-case as it seems the most readable when you need whitespace. However, some programs interpret the dash as a minus sign, and it can cause problems as what you think is a name turns into a subtraction operation.
first-second // first minus second?
ten-2 // ten minus two?
Also, some frameworks parse dashes in kebab cased property. For example, GitHub Pages uses Jekyll, and Jekyll parses any dashes it finds in an md file. For example, a file named 2020-1-2-homepage.md on GitHub Pages gets put into a folder structured as \2020\1\2\homepage.html when the site is compiled.
Snake_case vs kebab-case
A safer alternative to kebab-case is snake_case, or SCREAMING_SNAKE_CASE, as underscores cause less confusion when compared to a minus sign.
I'd simply say that it was hyphenated.
Worth to mention from abolish:
https://github.com/tpope/vim-abolish/blob/master/doc/abolish.txt#L152
dash-case or kebab-case
In Salesforce, It is referred as kebab-case. See below
https://developer.salesforce.com/docs/component-library/documentation/lwc/lwc.js_props_names
Here is a more recent discombobulation. Documentation everywhere in angular JS and Pluralsight courses and books on angular, all refer to kebab-case as snake-case, not differentiating between the two.
Its too bad caterpillar-case did not stick because snake_case and caterpillar-case are easily remembered and actually look like what they represent (if you have a good imagination).
My ECMAScript proposal for String.prototype.toKebabCase.
String.prototype.toKebabCase = function () {
return this.valueOf().replace(/-/g, ' ').split('')
.reduce((str, char) => char.toUpperCase() === char ?
`${str} ${char}` :
`${str}${char}`, ''
).replace(/ * /g, ' ').trim().replace(/ /g, '-').toLowerCase();
}
This casing can also be called a "slug", and the process of turning a phrase into it "slugify".
https://hexdocs.pm/slugify/Slug.html

How is this code protecting against XSS attacks?

I'm using a validation library that removes some common XSS attacks from the input to my web application. It works fine, and I'm also escaping everything I render to protect against XSS attacks.
The library contains this line in part of the XSS filtering process:
// Protect query string variables in URLs => 901119URL5918AMP18930PROTECT8198
str = str.replace(/\&([a-z\_0-9]+)\=([a-z\_0-9]+)/i, xss_hash() + '$1=$2');
xss_hash returns a string of random alpha-numeric characters. Basically it takes a URL with a query string, and mangles it a bit:
> xss('http://example.com?something=123&somethingElse=456&foo=bar')
'http://example.com?something=123eujdfnjsdhsomethingElse=456&foo=bar'
Besides having a bug (it only "protects" one parameter, not all of them), it seems to me the whole thing is itself a bug.
So my question is: what kind of attack vector is this kind of replacement protecting against?
If it's not really doing anything, I would like to submit a patch to the project removing it completely. And if it is legitimately protecting users of the library, I'd like to submit a patch to fix the existing bug.
xss_hash returns a string of random alpha-numeric characters.
Are they definitely random, or is it generated based on computable data?
It appears to be Security through obscurity: it's trying to replace all the &'s with xss_hash()'s so that the query is less readable. I'm guessing there is a part of the library which undoes this (i.e. treats all the xss_hash()'s in the string as &s for parsing purposes).
The code in question "protected query string variables" by replacing the & separating URL parameters with a random string, before doing some other processing that would remove or otherwise mangle ampersands. As Jay Shah pointed out, there was code just below that was meant to replace the query string ampersands, but another bug was preventing it from working as intended.

What is the smallest URL friendly encoding?

I am looking for a url encoding method that is most efficient in terms of space. Raw binary (base2) could be represented in base16 which is smaller and is url safe, but base64 is even more efficient. However, the usual base64 encoding isn't url safe....
So what is the smallest encoding method that is also safe for URLS?
This is what the Base64 URL encoding variant is for.
It uses the same standard Base64 Alphabet except that + is changed to - and / is changed to _.
Most modern Base64 implementations will support this alternate encoding. If yours doesn't, it's usually just a matter of doing a search/replace on the Base64 input prior to decoding, or on the output prior to sending it to a browser.
You can use a 62 character representation instead of the usual base 64. This will give you URLs like the youtube ones:
http://www.youtube.com/watch?v=0JD55e5h5JM
You can use the PHP functions provided in this page if you need to map strings to a database numerical ID:
http://bsd-noobz.com/blog/how-to-create-url-shortening-service-using-simple-php
Or this one if you need to directly convert a numerical ID to a short URL string:
http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/
"base66" (theoretical, according to spec)
As far as I can tell, the optimal encoding for URLs is a "base66" encoding into the following alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789-_.~
These are all the "Unreserved characters" according the URI specification RFC 3986 (section 2.3), so they will appear as-is in the URL. Using this "base66" encoding could give a URL like:
https://example.org/articles/.3Ja~jkWe
The question is then if you want . and ~ in your URLs?
On some older servers (ancient by now, I guess) ~joe would mean the "www directory" of the user joe on this server. And thus a user might be confused as to what the ~ character is doing in the middle of your URL.
This is common for academic websites, especially CS professors (e.g. Donald Knuth's website https://www-cs-faculty.stanford.edu/~knuth/)
"base80" (in practice, but not battle-tested)
However, in my own testing the following 14 other symbols also do not get
percent-encoded (in Chrome 95 and Firefox 93):
!$'()*+,:;=#[]
(see also this StackOverflow answer)
leaving a "base80" URL encoding possible. Some of these (notably + and =) would not work in the query string part of the URL, only in the path part. All in all, this ends up giving you beautiful, hyper-compressed URLs like:
https://example.org/articles/1OWG,HmpkySCbBy#RG6_,
https://example.org/articles/21Cq-b6Ud)txMEW$,hc4K
https://example.org/articles/:3Tx**U9X'd;tl~rR]q+
There's a plethora of reasons why you might not want all of those symbols in your URLs. One example is that StackOverflow's own "linkifier" won't include that ending comma in the link it generates (I've manually made it a part of the link here).
Also the percent encoding seems to be quite finicky. In some cases Firefox would initially percent-encode ' and ~] but on later requests would not.

Problem using unicode in URLs with cgi.PATH_INFO in ColdFusion

My ColdFusion (MX7 on IIS 6) site has search functionality which appends the search term to the URL e.g. http://www.example.com/search.cfm/searchterm.
The problem I'm running into is this is a multilingual site, so the search term may be in another language e.g. القاهرة leading to a search URL such as http://www.example.com/search.cfm/القاهرة
The problem is when I come to retrieve the search term from the URL. I'm using cgi.PATH_INFO to retrieve the path of the search page and the search term and extracting the search term from this e.g. /search.cfm/searchterm however, when unicode characters are used in the search they are converted to question marks e.g. /search.cfm/??????.
These appear actual question marks, rather than the browser not being able to format unicode characters, or them being mangled on output.
I can't find any information about whether ColdFusion supports unicode in the URL, or how I can go about resolving this and getting hold of the complete URL in some way - does anyone have any ideas?
Cheers,
Tom
Edit: Further research has lead me to believe the issue may related to IIS rather than ColdFusion, but my original query still stands.
Further edit
The result of GetPageContext().GetRequest().GetRequestUrl().ToString() is http://www.example.com/search.cfm/searchterm/????? so it appears the issue goes fairly deep.
Yeah, it's not really ColdFusion's fault. It's a common problem.
It's mostly the fault of the original CGI specification, which specifies that PATH_INFO has to be %-decoded, thus losing the original %xx byte sequences that would have allowed you to work out which real characters were meant.
And it's partly IIS's fault, because it always tries to read submitted %xx bytes in the path part as UTF-8-encoded Unicode (unless the path isn't a valid UTF-8 byte sequence in which case it plumps for the Windows default code page, but gives you no way to find out this has happened). Having done so, it puts it in environment variables as a Unicode string (as envvars are Unicode under Windows).
However most byte-based tools using the C stdio (and I'm assuming this applies to ColdFusion, as it does under Perl, Python 2, PHP etc.) then try to read the environment variables as bytes, and the MS C runtime encodes the Unicode contents again using the Windows default code page. So any characters that don't fit in the default code page are lost for good. This would include your Arabic characters when running on a Western Windows install.
A clever script that has direct access to the Win32 GetEnvironmentVariableW API could call that to retrieve a native-Unicode environment variable which they could then encode to UTF-8 or whatever else they wanted, assuming that the input was also UTF-8 (which is what you'd generally want today). However, I don't think CodeFusion gives you this access, and in any case it only works from IIS6 onwards; IIS5.x will throw away any non-default-codepage characters before they even reach the environment variables.
Otherwise, your best bet is URL-rewriting. If a layer above CF can convert that search.cfm/القاهرة to search.cfm/?q=القاهرة then you don't face the same problem, as the QUERY_STRING variable, unlike PATH_INFO, is not specified to be %-decoded, so the %xx bytes remain where a tool at CF's level can see them.
Here's what you could do:
<cfset url.searchTerm = URLEncodedFormat("القاهر", "utf-8") >
<cfset myVar = URLDecode(url.searchTerm , "utf-8") >
Ofcourse, I'd recommend that you work with something like this in that case:
yourtemplate.cfm?searchTerm=%C3%98%C2%A7%C3%99%E2%80%9E
And then you do URL rewriting in IIS (if not already done by framework/rest of the app) http://learn.iis.net/page.aspx/461/creating-rewrite-rules-for-the-url-rewrite-module/ to match your pattern.
You can set the character encoding of the URL and FORM scope using the setEncoding() function:
http://www.adobe.com/livedocs/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/html/wwhelp.htm?context=ColdFusion_Documentation&file=00000623.htm
You need to do this before you access any of the variables in this scope.
But, the default encoding of those scopes is already UTF-8, so this may not help. Also, this would probably not affect the CGI scope.
Is the IIS Server logging the correct characters into the request log?

Resources