ANTLR4 c++ Target - UTF8 conversion error in http parser - antlr4

i'm trying to use antlr4 to build http parser according to grammar in RFC 7230. I generated parser by antlr tool, and put it into my code. But when I'm trying to put data from browser (via tcp server) I got exception: UTF-8 string contains an illegal byte sequence.
If I understood correctly the RFC, I don't need any conversion to UTF-8, because data in header are in US ASCII, and body is simply set of bytes.
Can I disable conversion data to UTF-8 in antlr?
Thanks for advice

Related

Why do need metadata information specifying the encoding?

I feel a bit of a chicken and egg problem if i write a html meta tag specifying charset as say UTF-16 - like how do we decode the entire HTTP Request in the first place if we didn't know its UTF-16 data ? I believe request header needs to handle this and by the time we try to read metadata like say html tag charset="utf-16" we already know its UTF-16 .
Besides think one level higher about header information like Request Headers - are passed in ASCII as a standard ?
I mean at some level we need to agree upon and you can't set a data that is needed to decode as a metadata information . Can anyone clarify this ?
I am a bit confused on the idea of specifying a data that is needed to interpret the whole data as a metadata information inside the original data .
In general how can any form of encoding work if we don't have a standard agreed upon language/encoding to convey the data about the data itself ?
For example I am informed that Apache default has 8859-1 as the standard . So would all client need to enforce that for HTTP Headers and interpret the real content as UTF-8 if we want UTF-8 for the content-type ?
What character encoding should I use for a HTTP header? is a closely related question
UTF-16 (and other) encodings use a BOM (Byte Order Mark) that is read at the start of the file and that signals which encoding is being used. Only after that, the encoded part of the file begins.
For example, for UTF-16, you'll have the bytes FE FF if big-endian and FF FE if little-endian words are being used.
You also often see UTF-8 BOMs, although they don't need to be used (and may confuse some XML parsers).

Tornado decoding post arugment fails with UnicodeDecodeError

Some has been using my Tornado application and making POST requests which contain this character: ยก
Tornado was unable to decode the value and ended up with this error: HTTP 400: Bad Request (Invalid unicode in PARAMNAME: b'DATAHERE')
So I made some investigation and learned that In request body, I was receiving %A1 for the corresponding character, which python's decode method had no difficulty to decode for utf-8 encoding.
But, after URL-decoding this value, Tornado ended up with \xa1 for the character and tried to decode this using utf-8 and failed, because this was actually ISO-8859-1 encoding.
So, what should be the appropriate way to fix this? Because user is sending valid output I don't want to loose this data.
The best answer is to make sure the client always sends utf8 instead of iso8859-1 (this used to require weird tricks like the rails snowman; I'm not sure about the current state of the art). If you cannot do that, override RequestHandler.decode_argument (http://www.tornadoweb.org/en/stable/web.html#tornado.web.RequestHandler.decode_argument), which can see the raw bytes and decide how to decode them (or pass them through unchanged if you don't want to decode at this point).

How to read a UTF-8 encoded list of string tokens into a vector?

I have a UTF-8 encoded text file with one token per line. I would like to read it into a vector. This is on MSWindows, version 3.0.1. I understand that the default encoding is UTF-8, right?
I am looking for a code snippet like the ones on
http://www.mayin.org/ajayshah/KB/R/html/r4.html
from 'R by example'
http://www.mayin.org/ajayshah/KB/R/index.html
However they do not have a UTF-8 example, only ASCII.
You can either read it in with read.table() and then extract the column as a vector, or with scan().
vect <- scan(file="path/to/file1.txt", what=character(0) )
You would not need to use UTF-8 as the encoding, since you know that it is the default, but there is the option of doing so:
vect <- scan(file="path/to/file1.txt", what=character(0), encoding="UTF-8" )
The NEWS file for R 3.0.0 said:
" o readLines() and scan() (and hence read.table()) in a UTF-8 locale now discard a UTF-8 byte-order-mark (BOM). Such BOMs are allowed but not recommended by the Unicode Standard: however Microsoft applications can produce them and so they are sometimes found on websites.
The encoding name "UTF-8-BOM" for a connection will ensure that a UTF-8 BOM is discarded. "
So perhaps the need for the encoding argument indicated either that you were in a nonUTF-8 locale and didn't tell us or that you were using an outdated R version?

String data flow with encoding and decoding from browser to mysql database and vice versa

User inserts a string in a html form input on browser. This string is saved in database. How this string is encoded and decoded at each stage based on character encoding?
Flow as per technology stack used: browser --> ajax post --> spring mvc -->hibernate -->mysql db
You can expect that the browser post is an URL encoded UTF-8. Within the Java JVM, the string uses UTF-16, therefore roughly doubling the size taken if it is English text. Hibernate is part of that and it does not really care about the encoding, although it does pass around with connection strings as described next (hibernate.connection.url property).
The UTF-16 string is then translated by the JDBC driver which, in case of MySQL, will use the characterEncoding property inside the connection string. It helps if this matches the encoding of the database declared in CREATE DATABASE statement, avoiding another re-encoding.
Finally, "latin" is not a name of a specific character set or encoding. You probably mean ISO 8859-1, also known as Latin-1. This is not a good choice for a web server as it will not be able to represent most non-English strings. You should use UTF-8 in the database and in the connection string, ending up with UTF-8 -> UTF-16 -> UTF-8 which is a safe and reasonably efficient sequence (not counting any encoding that might be happening in the browser itself).
If you decide to alter the database to use UTF-8, be careful about changing the encoding at table level, too. Each table may use its own encoding and it does not change automatically.

Unmarshalling with JAXB leads to : javax.xml.bind.UnmarshalException (invalid byte sequence)

Here's my problem : I've written a program that unmarshals an XML file given as input and it turns out that my program works just fine on my development environment BUT this same program will yield the following exception on my client's environment :
javax.xml.bind.UnmarshalException
- with linked exception:
[java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.]
The XML file given as input to my program is using UTF-8 as encoding type. The Unmarshaller object is using the default encoding type, that is UTF-8, since I did not set any property value to it. Besides, I did not set a schema to the unmarshaller, so, I am not even requesting an XML validation.
Does anyone have any idea or has anyone already ran into the same problem?
Thanks in advance
I have already get this error. I have change my configuration to use ISO-8859-1 encoding :
marshaller.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
i can put UTF-8 strings in the xml flow, it's correctly marshall/unmarshall even if the encoding is not define like ISO-8859-1

Resources