String data flow with encoding and decoding from browser to mysql database and vice versa - string

User inserts a string in a html form input on browser. This string is saved in database. How this string is encoded and decoded at each stage based on character encoding?
Flow as per technology stack used: browser --> ajax post --> spring mvc -->hibernate -->mysql db

You can expect that the browser post is an URL encoded UTF-8. Within the Java JVM, the string uses UTF-16, therefore roughly doubling the size taken if it is English text. Hibernate is part of that and it does not really care about the encoding, although it does pass around with connection strings as described next (hibernate.connection.url property).
The UTF-16 string is then translated by the JDBC driver which, in case of MySQL, will use the characterEncoding property inside the connection string. It helps if this matches the encoding of the database declared in CREATE DATABASE statement, avoiding another re-encoding.
Finally, "latin" is not a name of a specific character set or encoding. You probably mean ISO 8859-1, also known as Latin-1. This is not a good choice for a web server as it will not be able to represent most non-English strings. You should use UTF-8 in the database and in the connection string, ending up with UTF-8 -> UTF-16 -> UTF-8 which is a safe and reasonably efficient sequence (not counting any encoding that might be happening in the browser itself).
If you decide to alter the database to use UTF-8, be careful about changing the encoding at table level, too. Each table may use its own encoding and it does not change automatically.

Related

Is there any reason why okhttp3.Credentials use ISO_8859_1 instead UTF-8?

I am using okhttp3.Credentials to get Base64 string in my current project. I spot an issue with cyrillic symbols I passed on server as Base64 string and eventually find out current implementation of okhttp3.Credentials uses ISO_8859_1.
Is there a thoughtful intent to go with ISO_8859_1 here instead of more universal UTF-8?
Update from answer:
From reference to spec
The original definition of this authentication scheme failed to
specify the character encoding scheme used to convert the user-pass
into an octet sequence. In practice, most implementations chose
either a locale-specific encoding such as ISO-8859-1 ([ISO-8859-1]),
or UTF-8 ([RFC3629]). For backwards compatibility reasons, this
specification continues to leave the default encoding undefined, as
long as it is compatible with US-ASCII (mapping any US-ASCII
character to a single octet matching the US-ASCII character code).
B.3. Why not simply switch the default encoding to UTF-8?
There are sites in use today that default to a local character
encoding scheme, such as ISO-8859-1 ([ISO-8859-1]), and expect user
agents to use that encoding. Authentication on these sites will stop
working if the user agent switches to a different encoding, such as
UTF-8.
Note that sites might even inspect the User-Agent header field
([RFC7231], Section 5.5.3) to decide which character encoding scheme
to expect from the client. Therefore, they might support UTF-8 for
some user agents, but default to something else for others. User
agents in the latter group will have to continue to do what they do
today until the majority of these servers have been upgraded to
always use UTF-8.
Discussion here https://github.com/square/okhttp/pull/3134
It's legacy and you can override with the optional param.

Informix - Decode a Base-64 encoded string

I have an attribute in my table which gets converted to a Base-64 encoded string by some .NET code. Now I want to decode the same, but I want to do it in the select query itself.
I am not finding any info (as usual with Informix documentation), but it seems such a thing does not exist. Am I correct?
select foranea_ci_persona, ip, BASE64DECODE(query) as Consulta, fecha_hora
from Historial;
The query above does not work at all.

Why do need metadata information specifying the encoding?

I feel a bit of a chicken and egg problem if i write a html meta tag specifying charset as say UTF-16 - like how do we decode the entire HTTP Request in the first place if we didn't know its UTF-16 data ? I believe request header needs to handle this and by the time we try to read metadata like say html tag charset="utf-16" we already know its UTF-16 .
Besides think one level higher about header information like Request Headers - are passed in ASCII as a standard ?
I mean at some level we need to agree upon and you can't set a data that is needed to decode as a metadata information . Can anyone clarify this ?
I am a bit confused on the idea of specifying a data that is needed to interpret the whole data as a metadata information inside the original data .
In general how can any form of encoding work if we don't have a standard agreed upon language/encoding to convey the data about the data itself ?
For example I am informed that Apache default has 8859-1 as the standard . So would all client need to enforce that for HTTP Headers and interpret the real content as UTF-8 if we want UTF-8 for the content-type ?
What character encoding should I use for a HTTP header? is a closely related question
UTF-16 (and other) encodings use a BOM (Byte Order Mark) that is read at the start of the file and that signals which encoding is being used. Only after that, the encoded part of the file begins.
For example, for UTF-16, you'll have the bytes FE FF if big-endian and FF FE if little-endian words are being used.
You also often see UTF-8 BOMs, although they don't need to be used (and may confuse some XML parsers).

What is the smallest URL friendly encoding?

I am looking for a url encoding method that is most efficient in terms of space. Raw binary (base2) could be represented in base16 which is smaller and is url safe, but base64 is even more efficient. However, the usual base64 encoding isn't url safe....
So what is the smallest encoding method that is also safe for URLS?
This is what the Base64 URL encoding variant is for.
It uses the same standard Base64 Alphabet except that + is changed to - and / is changed to _.
Most modern Base64 implementations will support this alternate encoding. If yours doesn't, it's usually just a matter of doing a search/replace on the Base64 input prior to decoding, or on the output prior to sending it to a browser.
You can use a 62 character representation instead of the usual base 64. This will give you URLs like the youtube ones:
http://www.youtube.com/watch?v=0JD55e5h5JM
You can use the PHP functions provided in this page if you need to map strings to a database numerical ID:
http://bsd-noobz.com/blog/how-to-create-url-shortening-service-using-simple-php
Or this one if you need to directly convert a numerical ID to a short URL string:
http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/
"base66" (theoretical, according to spec)
As far as I can tell, the optimal encoding for URLs is a "base66" encoding into the following alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789-_.~
These are all the "Unreserved characters" according the URI specification RFC 3986 (section 2.3), so they will appear as-is in the URL. Using this "base66" encoding could give a URL like:
https://example.org/articles/.3Ja~jkWe
The question is then if you want . and ~ in your URLs?
On some older servers (ancient by now, I guess) ~joe would mean the "www directory" of the user joe on this server. And thus a user might be confused as to what the ~ character is doing in the middle of your URL.
This is common for academic websites, especially CS professors (e.g. Donald Knuth's website https://www-cs-faculty.stanford.edu/~knuth/)
"base80" (in practice, but not battle-tested)
However, in my own testing the following 14 other symbols also do not get
percent-encoded (in Chrome 95 and Firefox 93):
!$'()*+,:;=#[]
(see also this StackOverflow answer)
leaving a "base80" URL encoding possible. Some of these (notably + and =) would not work in the query string part of the URL, only in the path part. All in all, this ends up giving you beautiful, hyper-compressed URLs like:
https://example.org/articles/1OWG,HmpkySCbBy#RG6_,
https://example.org/articles/21Cq-b6Ud)txMEW$,hc4K
https://example.org/articles/:3Tx**U9X'd;tl~rR]q+
There's a plethora of reasons why you might not want all of those symbols in your URLs. One example is that StackOverflow's own "linkifier" won't include that ending comma in the link it generates (I've manually made it a part of the link here).
Also the percent encoding seems to be quite finicky. In some cases Firefox would initially percent-encode ' and ~] but on later requests would not.

double base64 encoding danger & base64 security

I'm adding some capabilities to an api to allow third parties to store user data. I know some users may already base64 encode their user ids before submitting them through the api, others might not.
I've done some checking on double encoding (encoding base64 of an already base64 encoded string), and it doesn't SEEM to be causing any problems.
From my understanding, it isn't possible to check if a string is base64 encoded.
Is there something here I should be looking out for down the line? is there another way I should be doing this, or is it safe?
Also, i'm cleaning the data like this.
$eid=preg_replace("/[^a-zA-Z0-9\/=+]/", "",base64_encode(#$_GET['eid']));
that should be safe to store in the database as it strips out any suspect characters after the string is encoded. But I'll need to return the non-encoded string to through the API.
So at some point I'll need to do echo base64_decode($eid); And it seems to me that this could be an opportunity for a hacker to run malicious code through my server.
Is that right?

Resources