Why do need metadata information specifying the encoding? - web

I feel a bit of a chicken and egg problem if i write a html meta tag specifying charset as say UTF-16 - like how do we decode the entire HTTP Request in the first place if we didn't know its UTF-16 data ? I believe request header needs to handle this and by the time we try to read metadata like say html tag charset="utf-16" we already know its UTF-16 .
Besides think one level higher about header information like Request Headers - are passed in ASCII as a standard ?
I mean at some level we need to agree upon and you can't set a data that is needed to decode as a metadata information . Can anyone clarify this ?
I am a bit confused on the idea of specifying a data that is needed to interpret the whole data as a metadata information inside the original data .
In general how can any form of encoding work if we don't have a standard agreed upon language/encoding to convey the data about the data itself ?
For example I am informed that Apache default has 8859-1 as the standard . So would all client need to enforce that for HTTP Headers and interpret the real content as UTF-8 if we want UTF-8 for the content-type ?
What character encoding should I use for a HTTP header? is a closely related question

UTF-16 (and other) encodings use a BOM (Byte Order Mark) that is read at the start of the file and that signals which encoding is being used. Only after that, the encoded part of the file begins.
For example, for UTF-16, you'll have the bytes FE FF if big-endian and FF FE if little-endian words are being used.
You also often see UTF-8 BOMs, although they don't need to be used (and may confuse some XML parsers).

Related

Want To know the Deoding Method

I wanted to decode my PUBG name. I come to interact with this site: http://ddecode.com/hexdecoder/
It decodes as I want, but now I want to know what technique they use, so I can use it in my project.
Input :
PSYCH%C3%98%E4%B9%82JOKER
Decoded String:
PSYCHØ乂JOKER
Here Is The result Url: http://ddecode.com/hexdecoder/?results=48d3b517a922349a1838240623f6e7c3
You should take a look at Percent encoding, this is a way to encode stuff to be valid written in URLs. The characters after the % symbol are just the hexadecimal UTF-8 values to encode the special characters Ø乂.
0xC3 0x98 corresponds to Ø and 0xE4 0xB9 0x82 to 乂 in UTF-8.
By the way, since you added the encryption badge and wrote the word in your question. In this situation, we cannot speak of decryption; you might want to take a look at the difference between all that terminology (encoding and encryption, for example).

base 64 Decode XML values using Groovy script

I will be receiving the following XML data in a variable.
<order>
<name>xyz</name>
<city>abc</city>
<string>aGVsbG8gd29ybGQgMQ==</string>
<string>aGVsbG8gd29ybGQgMg==</string>
<string>aGVsbG8gd29ybGQgMw==</string>
</order>
Output:
<order>
<name>xyz</name>
<city>abc</city>
<string>hello world 1</string>
<string>hello world 2</string>
<string>hello world 3</string>
</order>
I know how I can decode from base64 but the problem is some of the values are decoded already and some are encoded. What is the best approach to decode this data using groovy so that I get the output as shown?
Always: tag value will be encoded. rest all other tags and value will be decoded.
Since there's no uncertainty on which nodes could come encoded and which not, hence no need to detect base64 encoding, the way to do it is pretty simple:
Parse it. There's two preferable ways to do that in Groovy: XmlSlurper & XmlParser. They differ in computation & mem consumption modes, both provide object/structure representation in the end, though.
Work with that object structure: traverse all required elements, decode the content/attributes you need to decode.
Either proceed further with the data with them and/or serialize it back to the XML text.
Articles to look at:
Load, modify, and write an XML document in Groovy
https://www.baeldung.com/groovy-xml
https://groovy-lang.org/processing-xml.html
and many, many more.
Another cheat sheet always useful for Groovy noobs: http://groovy-lang.org/groovy-dev-kit.html
Check out how to traverse the structures there, for instance.

Decode file names from MIME messages with LotusScript

I am parsing a MIME mail with LotusScript to get all attachments. But I get issues when it comes to encoded file names in the header. I got one file with the name
"HE336 =?Windows-1251?Q?=CF=E0=EA=E5=F2_=E4=EE=EA=F3=EC=E5=ED=F2=EE=E2.pdf?="
Is there any way to decode it with LotusScript?
The string I get is RFC 2047 header encoding. I found that Notes supports it in MIME headers. The issue I had is when I used MIMEHeader.GetParamVal it always returns the encoded value. However MIMEHeader.GetHeaderVal and GetHeaderValAndParams has an extra parameter
boolean decoded
true decodes any RFC-2047 encodings
false (default) retains any encodings; false is enforced if folded is true
When this is set to true, I get a decoded value.
it's been a while but I once used Julian Robichaux's Base64 classes with Jave and/or LS. You should be able to achieve what you are looking for with these.
Base64Encoding
Hope that helps.
Best wishes - Michael

String data flow with encoding and decoding from browser to mysql database and vice versa

User inserts a string in a html form input on browser. This string is saved in database. How this string is encoded and decoded at each stage based on character encoding?
Flow as per technology stack used: browser --> ajax post --> spring mvc -->hibernate -->mysql db
You can expect that the browser post is an URL encoded UTF-8. Within the Java JVM, the string uses UTF-16, therefore roughly doubling the size taken if it is English text. Hibernate is part of that and it does not really care about the encoding, although it does pass around with connection strings as described next (hibernate.connection.url property).
The UTF-16 string is then translated by the JDBC driver which, in case of MySQL, will use the characterEncoding property inside the connection string. It helps if this matches the encoding of the database declared in CREATE DATABASE statement, avoiding another re-encoding.
Finally, "latin" is not a name of a specific character set or encoding. You probably mean ISO 8859-1, also known as Latin-1. This is not a good choice for a web server as it will not be able to represent most non-English strings. You should use UTF-8 in the database and in the connection string, ending up with UTF-8 -> UTF-16 -> UTF-8 which is a safe and reasonably efficient sequence (not counting any encoding that might be happening in the browser itself).
If you decide to alter the database to use UTF-8, be careful about changing the encoding at table level, too. Each table may use its own encoding and it does not change automatically.

What is the smallest URL friendly encoding?

I am looking for a url encoding method that is most efficient in terms of space. Raw binary (base2) could be represented in base16 which is smaller and is url safe, but base64 is even more efficient. However, the usual base64 encoding isn't url safe....
So what is the smallest encoding method that is also safe for URLS?
This is what the Base64 URL encoding variant is for.
It uses the same standard Base64 Alphabet except that + is changed to - and / is changed to _.
Most modern Base64 implementations will support this alternate encoding. If yours doesn't, it's usually just a matter of doing a search/replace on the Base64 input prior to decoding, or on the output prior to sending it to a browser.
You can use a 62 character representation instead of the usual base 64. This will give you URLs like the youtube ones:
http://www.youtube.com/watch?v=0JD55e5h5JM
You can use the PHP functions provided in this page if you need to map strings to a database numerical ID:
http://bsd-noobz.com/blog/how-to-create-url-shortening-service-using-simple-php
Or this one if you need to directly convert a numerical ID to a short URL string:
http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/
"base66" (theoretical, according to spec)
As far as I can tell, the optimal encoding for URLs is a "base66" encoding into the following alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789-_.~
These are all the "Unreserved characters" according the URI specification RFC 3986 (section 2.3), so they will appear as-is in the URL. Using this "base66" encoding could give a URL like:
https://example.org/articles/.3Ja~jkWe
The question is then if you want . and ~ in your URLs?
On some older servers (ancient by now, I guess) ~joe would mean the "www directory" of the user joe on this server. And thus a user might be confused as to what the ~ character is doing in the middle of your URL.
This is common for academic websites, especially CS professors (e.g. Donald Knuth's website https://www-cs-faculty.stanford.edu/~knuth/)
"base80" (in practice, but not battle-tested)
However, in my own testing the following 14 other symbols also do not get
percent-encoded (in Chrome 95 and Firefox 93):
!$'()*+,:;=#[]
(see also this StackOverflow answer)
leaving a "base80" URL encoding possible. Some of these (notably + and =) would not work in the query string part of the URL, only in the path part. All in all, this ends up giving you beautiful, hyper-compressed URLs like:
https://example.org/articles/1OWG,HmpkySCbBy#RG6_,
https://example.org/articles/21Cq-b6Ud)txMEW$,hc4K
https://example.org/articles/:3Tx**U9X'd;tl~rR]q+
There's a plethora of reasons why you might not want all of those symbols in your URLs. One example is that StackOverflow's own "linkifier" won't include that ending comma in the link it generates (I've manually made it a part of the link here).
Also the percent encoding seems to be quite finicky. In some cases Firefox would initially percent-encode ' and ~] but on later requests would not.

Resources