Is there any reason why okhttp3.Credentials use ISO_8859_1 instead UTF-8? - base64

I am using okhttp3.Credentials to get Base64 string in my current project. I spot an issue with cyrillic symbols I passed on server as Base64 string and eventually find out current implementation of okhttp3.Credentials uses ISO_8859_1.
Is there a thoughtful intent to go with ISO_8859_1 here instead of more universal UTF-8?
Update from answer:
From reference to spec
The original definition of this authentication scheme failed to
specify the character encoding scheme used to convert the user-pass
into an octet sequence. In practice, most implementations chose
either a locale-specific encoding such as ISO-8859-1 ([ISO-8859-1]),
or UTF-8 ([RFC3629]). For backwards compatibility reasons, this
specification continues to leave the default encoding undefined, as
long as it is compatible with US-ASCII (mapping any US-ASCII
character to a single octet matching the US-ASCII character code).
B.3. Why not simply switch the default encoding to UTF-8?
There are sites in use today that default to a local character
encoding scheme, such as ISO-8859-1 ([ISO-8859-1]), and expect user
agents to use that encoding. Authentication on these sites will stop
working if the user agent switches to a different encoding, such as
UTF-8.
Note that sites might even inspect the User-Agent header field
([RFC7231], Section 5.5.3) to decide which character encoding scheme
to expect from the client. Therefore, they might support UTF-8 for
some user agents, but default to something else for others. User
agents in the latter group will have to continue to do what they do
today until the majority of these servers have been upgraded to
always use UTF-8.

Discussion here https://github.com/square/okhttp/pull/3134
It's legacy and you can override with the optional param.

Related

What is this string encoding?

I'm doing some reverse engineering and came across this string in one field of a ProtoBuf-encoded HTTP request. The ProtoBuf binary structure is already decoded; this is the contents of one of its fields.
Does anybody recognize this encoding? It's not base 64 and doesn't appear to be escaped Unicode characters since there are regular non-escaped characters interspersed throughout.
\002\000\000\000P\030\326--\037\352Hx\232\244\322.\224\'\246\004P\3314\372g\274\366\362\337\277\226b\236\nr8\351 ]u\362\214\374\330O5\246Y+\276\005\212\234\017\216\333\312*\313g\357\267t\227\034\244_}\205jiO\261\271\304\013\224\373.zZ\224\230\260\004\2411\000\323\362\345K\300h\307\\\220\335\304\022\357\2230\355\375\032\210\330\2711\374\272\336\277kC]\334?\226\370w\262\023)4 D\273\344H\212\000\347}u\336lOp\237\3666\337j\002*s\033|\010\000\000%\3157/w\327\364\252)\235\245wQ\325+W\026*\215\357E\005\271w\002\246\216\325\002&e\217T!\242\376\307\321\267\016_\017Q\265p\007\035\367\324\216H\314\222\3244\004\353b\017\325\025N\017\205dk\257\237g\"\367\245\324*\204^\010\233\244\002\266\007\231\226w\006\2056\313\265_\236Y\270\nP\216\nq\373\330#\345,\271\241\177\331\271\023K\227\013\317d\335mg\255\266\232pp2d\253\332A:Gs!0>O\226\315_\264G\234\326\240\213\261\253\017\352\214\365\007{ \022\365r<\306\354\355]\320\010\2511\225\215M\276\366P\264\003\315F\314\301\244\350\034\316P\375\317v(\360\244\347\371<$$9\360\267\340H\372\362\271\307\357\215J\3433\215\331=\tqQ1\354\213\333R\331--\213Tc\352a\337\236\346[#-\266]\354\202\335\307\333\330\213\351b\254kt\304\210\276\300\013\322\306\242\000\037\177\354#[U|\302j\r{\243\247\257\000XY\344\020x*\310\363\242^\315\271\371\335\210\030\310\255S\240)\234C\'\200+\313\246oM\304\271^6s\345IG3.\306Xhf\235\244\004[\314Tb\373\360\023\3565\265\254\351\236\227b\337\264\207\302\nq\t\\\236\253e\271I\244u\004\204\220\336\367\333\337# \005\361IPe\270v\016\010RPd\306r\254\2651J\250P\320\312l\177g\036\021\276fC\0136\363\372\265\003v\243y\215w\266q\364a\025\210\264\251\227\333\235r\316\275\237r\256o\231\263\3358?\240\001\306: \211\223\375-\247=\361\207\022\321Gb\326\230x\342\203\014^\371\243\037N\000\376+\370\302\351\227F\025\025\017C\225x]\201c\370\373{\230\2656\222\334,\266\016Q\320\005-Y\203z\200\207\205\2667\264\320\027\250\3007,\303\204\006\357\036\254)\271\271!;\233\300P\250\220\306V/_EP\352\272-\242\276q\252C\337BV\022\357\2467y\025\377A\017u\\\335m\352\037\215\026f\354\3100\312\032\235i\333\312|h\255\266\376\234\345\\\361lC\n\022\341K\022cnU\217\'\222Hl\312\006;0\003V\006\255\256\016\262Z\220zo\002\004\316\370\317\371\220O^q\247\313g\301\376\354W\346\001F\262\233\354\024\004kzk\032\313\0132u\346R\013z:TQ\007\347\273\343\022&X\357\334\305\307;\221W\301\236\360Ap\311\t\024W\004i\221\301a\356}\036\362\002J\267R\335\371(\357\025<\322H\232\334a\375\215eSl\324\214P\367\377T\236\346\346\026\367h\214\275;\013\205\n\302%\\\017\227a\373\376\347\222\\\014cT\340\'\361\024t\t:\203c\314\361W\252\336+\376e\353\336\237\272\2745\315\354\356\272\037Z\246Z\277;\344j\271\022\273\274\025\367\037\257\372p\204\224\314\244\026&o\365\220\235`\365c\377\306\304)&f]q\241\252|d\270H\010?\300$\275\200^!\r\272_\237V\241=\245\020#\314\362\032\031\312t\037\0344\254\264\213Y\315:\215\271\222\277\332\007\220\t\357N]\361O\\\257\352<F\001{\214\317\226\314\'&\232\026\314\350\020\200\316\370\216\231\325\2574\373R\231\316\251\257\260z!\033\203\357\364\310\021\0029\000)\034\010\276Tr\336y0\376\232h~\332y-\354\327w\220\254\321\022\210\266\345\245\325gy\210\357\356\215P$\270\372\3169\365\022\357\225A\324\352\313\340\3445\247\267\352{\037\266\244\205\262\023\t\\\224\020\236\307C\241\371\214\345\216^\271\320\345?\0052\341TD\235j\370\306\236\274\254J\213 \377\212K\032\265\251\367,Q\331\0067ZE\235\253\256\311\022\320\232\205p\262\370\032h\255\304\304D\366\340\276\006\200\307S\230\340?\212jj\261\377r\337\223 \305\217\310\344Xi()*\225z[Y\313_t}\331\240\000>\024:\3242\322\030\352ZWB\247`\320\340\243\204\224\312&\274\321qi\375\231\374\201\235\234{\344\367\002lO\350\363X\361\rh)\231\337r\361\306w\360B\271\013\233IoG\245~:X5%h\222\247J\\w\373\266\374\340\314\313\226\224\204:\250\363\243\265H|\003Y\263\023sZ7#\351)V]{\3065E\210t\207\353^\205q\211\003Yj\373\227Qb4v\2213TO\"S\301^\272\035\t\212|eJ\332t\243\177\274\016ni^ 8\273\317p(N\263j\375\254k\253h\206%ta*LM\270v\2473\220\263\366\211\302=Q\217~\0029\246\236\374\350\247%\221\001`B\337\321N\216wR\235\336\244.K;Y\330\033\372i<\3156{z\310\255\031\021wr{{\331F01\227\010\346B#\341\276\'\246\372S\250\356\222\370,\334h\217\025\334S\016\005\007/,\024\355\024V\246\007\036;\030\337\002c\254\304[\253\tN\331X|$[%*\242\353\254\227>\031\304\203\275\277``c\240\344\277\213\377\204\223\202\026#\367\271\302k\027\262\020H\024p\010\203\264iM\233F7\333\354\352\303\223\217Hi\'\375\010\302\035\013\273F(\032\272\377\252:8\213\304\036\264y\t\265\025\300\317\324za4\010I$Eu\310,\006y,^\3531\027g\343o\314j\270\3152gif\271(\037g\031\375\325\341\320\320\317HJ+\374>%\320\234V\317\332\232x\034x\233R9\245\346_r\307{\030y\234z\331zV\031\264\035\324\003\260AQ\024\217\230\213w\021\3205g\273\275nn\357\275\217?Kd\031\353yF\'\234\201\335+\177\350\001\340D\324\"\340\335\254\304\360=\301\'$\274\235e\032$N\345+\244WKC\204\342\024\307\3103\2722\024\216\002\221UbTn\233\244\261\347\303\340A\312l\317\263Gm\352\000v\245X\334\"\263\315z\374N\244\365\013\375\260\220\251\203\036gD\364p3i#n\016\031[,\336\300\0000\352\001NK()\214\023\222w\014B\242\220\206\034\333\256\265\331-\220\361F\203s\014S\236p\265\236\343g\020HR\235\325W\360\030(\374\341\000\261\315s\315vv\017]s\311o\033c\206\303\245\347\372C\345\207\244\207AL+\306c\026\001\307\3409\331\205\340\371\365\006\263\352kF\010\035K\354\225\035\341\014\360*\232\035\251\t\344\205\374\235\374\352\n}\262+\252\321\377\010G\215\263GA\230\364Z\037\323\351\220\226\272\002\207\254\241\263X\t_ N\307\326\350\246hI\223\223J&\373-\344\243\316\300m.FHmNdS?\tCf\001\252\307\346\205H\026\375)#\006\261g\036\307\252\205\000\027}\212_\021 )4\207#\213n\254H\205\036\325q\217\025\305\036\010J\017\320\257\203\226\025X\313\032,\003\341\003\023cw\375r\337\223\233>+\335\223\206\203}\035!\3100\242Tv\350\255\276\343&\220\213\361\354Ij\035\312_\273\233\333\327;\022\016\315a5\373\217t\324ZJ\202\304(~,B(\215\005E\341\375\036\260C\213\364\240\020\373\340\275\310\2048*\326\"^$\366\367\252#\201\355\000\273\010#`J\230\363\320\363L9\261\216\353((#;3\366oKR\021\nL\244a\244\376\032\304\376\001|\317c\222%c=\\\225\340I\225\301\277G\227\242\366\025\323y0\273\241\217E[\032=\253e\001\270q\005\241\374\276\267$\277Lj\3528\257z\247\242+}\304\254(\013\336g\230\237\270\212I#\245\247\271)\026i\346\366\342\021\005\373i\341`A\020|\367\337\312$`\241\322\007YaQ#\216cy&\371\206\223\264+g\0213b\315\217\371\364\013x\327\2478\0013\352\372\375E\233\352\200\213<\021puH\347x;\354\036\024\\\253_\340\200xH\353\350b\364\207\276*\323J\341\200\r\276]e\217\307\305\275\350\004V\300\272\271\010\345KM\330\2716$\030\225\223\322\347\325\260\331Ok\0340Y\241\276\353\223\276\253>\256\022\257CE\320\007D\236\201\026\214\177\036\277\347\031\001\254\240L\203\n\332\252c\211Y\031\310\212\r+J\274E0

Node Buffer Alias - binary is latin1?

According to this page:
'binary' - Alias for 'latin1'.
However, binary is not representable in latin1 as there are certain codepoints which are missing. So, a developer like me who wants to use NodeJS buffers for binary data (an extremely common use-case) would expect to use 'binary' as the encoding. There doesn't appear to be any documentation that properly explains how to handle binary data! I'm trying to make sense of this.
So my question is: Why was latin1 chosen as an alias for binary?
People have mentioned that using null as the encoding will work for binary data. So a followup question: Why doesn't null and 'binary' do the same thing?
The Node documentation's definition of 'latin1', on the line above the definition of 'binary' cited in the question, is not ISO 8859-1. It is:
'latin1' - A way of encoding the Buffer into a one-byte encoded string
(as defined by the IANA in RFC1345, page 63, to be the Latin-1
supplement block and C0/C1 control codes).
The 'latin1' charset specified in RFC 1345 defines mappings for all 256 codepoints. It does not have the holes that exist at 0x00-0x1f and 0x7f-0x9f in the ISO 8859-1 mapping.
Why doesn't null and 'binary' do the same thing?
Node does not have a null encoding. If you call Buffer.from('foo', null) then you get the same result as if you had called Buffer.from('foo'). That is, the default encoding is applied. The default encoding is 'utf8', and clearly that can produce results that differ from the 'binary' encoding.

String data flow with encoding and decoding from browser to mysql database and vice versa

User inserts a string in a html form input on browser. This string is saved in database. How this string is encoded and decoded at each stage based on character encoding?
Flow as per technology stack used: browser --> ajax post --> spring mvc -->hibernate -->mysql db
You can expect that the browser post is an URL encoded UTF-8. Within the Java JVM, the string uses UTF-16, therefore roughly doubling the size taken if it is English text. Hibernate is part of that and it does not really care about the encoding, although it does pass around with connection strings as described next (hibernate.connection.url property).
The UTF-16 string is then translated by the JDBC driver which, in case of MySQL, will use the characterEncoding property inside the connection string. It helps if this matches the encoding of the database declared in CREATE DATABASE statement, avoiding another re-encoding.
Finally, "latin" is not a name of a specific character set or encoding. You probably mean ISO 8859-1, also known as Latin-1. This is not a good choice for a web server as it will not be able to represent most non-English strings. You should use UTF-8 in the database and in the connection string, ending up with UTF-8 -> UTF-16 -> UTF-8 which is a safe and reasonably efficient sequence (not counting any encoding that might be happening in the browser itself).
If you decide to alter the database to use UTF-8, be careful about changing the encoding at table level, too. Each table may use its own encoding and it does not change automatically.

What is the smallest URL friendly encoding?

I am looking for a url encoding method that is most efficient in terms of space. Raw binary (base2) could be represented in base16 which is smaller and is url safe, but base64 is even more efficient. However, the usual base64 encoding isn't url safe....
So what is the smallest encoding method that is also safe for URLS?
This is what the Base64 URL encoding variant is for.
It uses the same standard Base64 Alphabet except that + is changed to - and / is changed to _.
Most modern Base64 implementations will support this alternate encoding. If yours doesn't, it's usually just a matter of doing a search/replace on the Base64 input prior to decoding, or on the output prior to sending it to a browser.
You can use a 62 character representation instead of the usual base 64. This will give you URLs like the youtube ones:
http://www.youtube.com/watch?v=0JD55e5h5JM
You can use the PHP functions provided in this page if you need to map strings to a database numerical ID:
http://bsd-noobz.com/blog/how-to-create-url-shortening-service-using-simple-php
Or this one if you need to directly convert a numerical ID to a short URL string:
http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/
"base66" (theoretical, according to spec)
As far as I can tell, the optimal encoding for URLs is a "base66" encoding into the following alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789-_.~
These are all the "Unreserved characters" according the URI specification RFC 3986 (section 2.3), so they will appear as-is in the URL. Using this "base66" encoding could give a URL like:
https://example.org/articles/.3Ja~jkWe
The question is then if you want . and ~ in your URLs?
On some older servers (ancient by now, I guess) ~joe would mean the "www directory" of the user joe on this server. And thus a user might be confused as to what the ~ character is doing in the middle of your URL.
This is common for academic websites, especially CS professors (e.g. Donald Knuth's website https://www-cs-faculty.stanford.edu/~knuth/)
"base80" (in practice, but not battle-tested)
However, in my own testing the following 14 other symbols also do not get
percent-encoded (in Chrome 95 and Firefox 93):
!$'()*+,:;=#[]
(see also this StackOverflow answer)
leaving a "base80" URL encoding possible. Some of these (notably + and =) would not work in the query string part of the URL, only in the path part. All in all, this ends up giving you beautiful, hyper-compressed URLs like:
https://example.org/articles/1OWG,HmpkySCbBy#RG6_,
https://example.org/articles/21Cq-b6Ud)txMEW$,hc4K
https://example.org/articles/:3Tx**U9X'd;tl~rR]q+
There's a plethora of reasons why you might not want all of those symbols in your URLs. One example is that StackOverflow's own "linkifier" won't include that ending comma in the link it generates (I've manually made it a part of the link here).
Also the percent encoding seems to be quite finicky. In some cases Firefox would initially percent-encode ' and ~] but on later requests would not.

Problem using unicode in URLs with cgi.PATH_INFO in ColdFusion

My ColdFusion (MX7 on IIS 6) site has search functionality which appends the search term to the URL e.g. http://www.example.com/search.cfm/searchterm.
The problem I'm running into is this is a multilingual site, so the search term may be in another language e.g. القاهرة leading to a search URL such as http://www.example.com/search.cfm/القاهرة
The problem is when I come to retrieve the search term from the URL. I'm using cgi.PATH_INFO to retrieve the path of the search page and the search term and extracting the search term from this e.g. /search.cfm/searchterm however, when unicode characters are used in the search they are converted to question marks e.g. /search.cfm/??????.
These appear actual question marks, rather than the browser not being able to format unicode characters, or them being mangled on output.
I can't find any information about whether ColdFusion supports unicode in the URL, or how I can go about resolving this and getting hold of the complete URL in some way - does anyone have any ideas?
Cheers,
Tom
Edit: Further research has lead me to believe the issue may related to IIS rather than ColdFusion, but my original query still stands.
Further edit
The result of GetPageContext().GetRequest().GetRequestUrl().ToString() is http://www.example.com/search.cfm/searchterm/????? so it appears the issue goes fairly deep.
Yeah, it's not really ColdFusion's fault. It's a common problem.
It's mostly the fault of the original CGI specification, which specifies that PATH_INFO has to be %-decoded, thus losing the original %xx byte sequences that would have allowed you to work out which real characters were meant.
And it's partly IIS's fault, because it always tries to read submitted %xx bytes in the path part as UTF-8-encoded Unicode (unless the path isn't a valid UTF-8 byte sequence in which case it plumps for the Windows default code page, but gives you no way to find out this has happened). Having done so, it puts it in environment variables as a Unicode string (as envvars are Unicode under Windows).
However most byte-based tools using the C stdio (and I'm assuming this applies to ColdFusion, as it does under Perl, Python 2, PHP etc.) then try to read the environment variables as bytes, and the MS C runtime encodes the Unicode contents again using the Windows default code page. So any characters that don't fit in the default code page are lost for good. This would include your Arabic characters when running on a Western Windows install.
A clever script that has direct access to the Win32 GetEnvironmentVariableW API could call that to retrieve a native-Unicode environment variable which they could then encode to UTF-8 or whatever else they wanted, assuming that the input was also UTF-8 (which is what you'd generally want today). However, I don't think CodeFusion gives you this access, and in any case it only works from IIS6 onwards; IIS5.x will throw away any non-default-codepage characters before they even reach the environment variables.
Otherwise, your best bet is URL-rewriting. If a layer above CF can convert that search.cfm/القاهرة to search.cfm/?q=القاهرة then you don't face the same problem, as the QUERY_STRING variable, unlike PATH_INFO, is not specified to be %-decoded, so the %xx bytes remain where a tool at CF's level can see them.
Here's what you could do:
<cfset url.searchTerm = URLEncodedFormat("القاهر", "utf-8") >
<cfset myVar = URLDecode(url.searchTerm , "utf-8") >
Ofcourse, I'd recommend that you work with something like this in that case:
yourtemplate.cfm?searchTerm=%C3%98%C2%A7%C3%99%E2%80%9E
And then you do URL rewriting in IIS (if not already done by framework/rest of the app) http://learn.iis.net/page.aspx/461/creating-rewrite-rules-for-the-url-rewrite-module/ to match your pattern.
You can set the character encoding of the URL and FORM scope using the setEncoding() function:
http://www.adobe.com/livedocs/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/html/wwhelp.htm?context=ColdFusion_Documentation&file=00000623.htm
You need to do this before you access any of the variables in this scope.
But, the default encoding of those scopes is already UTF-8, so this may not help. Also, this would probably not affect the CGI scope.
Is the IIS Server logging the correct characters into the request log?

Resources