How to detect encoding errors in a Node.js Buffer - node.js

I'm reading a file in Node.js, into a Buffer object, and I'm decoding the UTF-8 content of the Buffer using Buffer.toString('utf8'). If there are encoding errors, I want to report a failure.
The toString() method handles decoding errors by substituting an xFFFD character, which I can detect by searching the result. But xFFFD is a legal character in the input file, and I don't want to report an error if the xFFFD was present and correctly encoded in the input.
Is there any way I can distinguish a Buffer that contains a legitimately-encoded xFFFD character from one that contains an encoding error?

The solution proposed by #eol in a comment on the question appears to meet the requirements.

Related

EncodingGroovyMethods and decodeBase64

I have ran into an issue on Windows where encoded file is read and decoded using EncodingGroovyMethods#decodeBase64:
getClass().getResourceAsStream('/endoded_file').text.decodeBase64()
This gives me:
bad character in base64 value
File itself has CRLF endings and groovy decodeBase64 implementation snippet has a comment so:
} else if (sixBit == 66) {
// RFC 2045 says that I'm allowed to take the presence of
// these characters as evidence of data corruption
// So I will
throw new RuntimeException("bad character in base64 value"); // TODO: change this exception type
}
I looked up RFC 2045 and CLRF pair is suppose to be legal. I have tried same with org.apache.commons.codec.binary.Base64#decodeBase64 and it works. Is this a bug in groovy or was this intentional ?
I am using groovy 2.4.7.
This is not a bug, but a different way of how corrupt data is handled. Looking at the source code of Base64 in Apache commons, you can see the documentation:
* Ignores all non-base64 characters. This is how chunked (e.g. 76 character) data is handled, since CR and LF are
* silently ignored, but has implications for other bytes, too. This method subscribes to the garbage-in,
* garbage-out philosophy: it will not check the provided data for validity.
So, while the Apache Base64 decoder silently ignores the corrupt data, the Groovy one will complain about it. The RFC documentation is a bit fuzzy about it:
In base64 data, characters other than those in Table 1, line breaks, and other
white space probably indicate a transmission error, about which a
warning message or even a message rejection might be appropriate
under some circumstances.
While warning messages are hardly useful (who checks for warnings anyway?), the Groovy authors decided to go into the path of 'message rejection'.
TLDR; they are both fine, just a different way of handling corrupt data. If you can, try to fix or reject the incorrect data.

Converting bits to string (data)

I have file which contains some data (text copied and pasted from the "What You Will Learn" portion of this PDF). Firstly, I have converted the contents in the file to bits successfully. However, when I try to convert it back to the original format, some of the characters are not correctly converted, as shown below:
Cisco has
developed the Cisco Open Network Environment (ONE)
architecture as a multifaceted approach to network
programmability delivered across three pillars:
??)É¥ Í?н??ÁÁ±¥?Ñ¥½¸ÁɽÉ?µµ¥¹?¥¹Ñ?É???Ì?¡A%̤?)?áÁ½Í??¥É?ѱ佸Íݥѡ?Ì?¹É½ÕÑ?ÉÌѼ?Õµ?¹Ð?)?á¥ÍÑ¥¹?=Á?¹±½ÜÍÁ?¥?¥?Ñ¥½¹Ì* ¤&öGV7F?öâ×&VG?÷VäfÆ÷r6öçG&öÆÆW"æB÷VäfÆ÷r ¦vVçG0¨?HÝZ]HÙ??ÙXÝÈÈ[]?\??\X[Ý?\?^\Ë?\X[?Ù\?XÙ\Ë[??\ÛÝ\?ÙHÜ?Ú\Ý?][Û?Ø\X?[]Y\È[?H?]HÙ[
As you can see here some characters are converted successfully, others are not.
My code is below:
file = open("test.txt",'r')
myfile = ''.join(map(str,file))
l = []
for i in myfile:
asc11 = ord(i)
b = "{0:08b}".format(asc11)
l.extend(int(y) for y in b)
string_bin = ''.join(map(str,l))
mydata = ''.join(chr(int(string_bin[i:i+8], 2)) for i in range(0,len(string_bin), 8))
print(mydata)
What wrong with my code? What I need to change to make it work properly?
What's Going On?
You are running into an encoding issue because some characters in the PDF are non-ASCII characters. For example, the bullet points are U+2022 which require 3 bytes of storage.
When Python reads from your file, it doesn't know what encoding you used to write that data. Thus it reads bytes from the file and uses a character encoding to translate them into strs which are stored using Python's own internal unicode format. (This differs from Python 2 where open() returned raw bytes stored in a str which you could then manually decoded to unicode.)
Thus, in Python 3, open() accepts a named encoding parameter. For example open("test.txt",'r', encoding='ascii'). Because you don't specify the encoding when you call open(), you end up using your system's default encoding. For instance, on my laptop, the default encoding is CP1252 (LATIN-1). Yours may differ.
Whatever encoding Python uses to interpret your file, it then internally uses it's own unicode format to store your string. This means that your string may internally use mutli-byte characters even if the original encoding did not. For example, my laptop uses CP1252 to interpret U+2022 as • which is internally stored as U+00e2, U+20AC and U+00A2 -- € is stored using a multi-byte character even though it was just one byte in the original file.
Let's assume you computer is sane and uses UTF-8 by default (this explanation is similar for many multi-byte characters). When you reach a bullet point, it is stored as U+2022. When you call ord('\u2022') the result is 8226. When you then call "{0:08b}".format(8226) this returns "10000000100010". That's a 14 character string. Your parsing code assumes all of the ordinals will generate 8 character strings. Because of this, the "binary" output becomes misaligned. This means that when you then parse the binary string in 8-character segments, it gets thrown off and starts interpreting things as control characters and all sorts of foreign language characters.
If you call open(..., encoding='ascii'), Python will actually throw an exception because it reads non-valid ASCII characters.
Possible Solutions
I'm not sure why exactly you are converting the input string into the representation that you are using. It's not binary, as your question title would suggest. Rather, you've converted the data into a textual representation of it's binary encoding.
Technically speaking, when you store encoded text to a file, it's stored using a binary representation. Python, and any text editor, has to decode those bytes into it's internal character representation before it can display them as text. Thus, calling open("test.txt", "r", encoding="utf-8") reads the binary data out of your text file and converts it into Python's internal unicode format. Similarly, calling myfile.encode('utf-8') will return the UTF-8 encoded bytes which can then be written to a file, network socket, etc.
If, however, you do need to use a format similar to what you are currently using, first, I still recommend you specify an encoding when you call open() (I recommend UTF-8). Then you can consider these options:
Detect and omit non-ASCII characters. They will have an ordinal >= 128.
Mimic UTF-16 or UTF-32 and output multi-byte output for all characters. For example, use "{0:032b}".format(asc11) and then parse the result in 32-character chunks. It's memory and storage inefficient, but it will preserve multi-byte characters.
Regardless, I highly recommend reading the Dive Into Python 3 chapter about strings.

Writing text files with node webkit

I'm trying to do something rather simple: write a text file with data entered in a text input field to a file...
var data = document.getElementById("fileContent").value;
fs.writeFileSync("test.txt", data);
For instance if I type in,
Write this to file 123 123
I end up with this in the file...
Write this to
If I hard code a string into the application, it writes correctly.
fs.writeFileSync("test.txt", "this is a hard coded string");
I tried using writeFileSync with and without the encoding parameter set. I've tried createWriteStream with and without encoding the parameter set. I've tried fileOpen, fs.writeSync, and fs.close. I even tried converting the date to a Buffer object and writing that. In every case, I got the exact same results.
The encoding is also strange. Notepad++ indicates that the encoding is "UCS2-LE w/o BOM" I'd expect it to be UTF-8, as I'v been setting the encoding parameter to that.
Any thoughts?
It's a bug with Node-Webkit-v0.9.*
It's OK if you use Node-Webkit-v.8.* or lower version.
After some more research and determining it was something with encoding, I stumbled on this post. Apparently, utf8 doesn't work...
https://groups.google.com/forum/#!msg/node-webkit/3M-0v92o9Zs/eSYnSZ8dUK0J
I changed the encoding it to "utf16le", and this appears to write the text correctly for hard-coded text and text from a text box.

Detect partial or incomplete characters read from a buffer

In a loop I am reading a stream, that is encoded as UTF-8, 10 bytes (say) in every loop. As the stream is being passed to a buffer first, I must specify its read length in bytes before transforming it to a UTF-8 string. The issue that I am facing is that sometimes it will read partial, incomplete characters. I need to fix this.
Is there a way to detect if a string ends with an incomplete character, or some check that I can perform on the last character of my string to determine this?
Preferably a “non single-encoding” solution would be the best.
If a buffer ends with an incomplete character and you convert it into a string and then initialize a new buffer from that string, the new buffer will be a different length (longer if you're using utf8, shorter if you're using ucs2) than the original.
Something like:
var b1=new Buffer(buf.toString('utf8'), 'utf8');
if (b2.length !== buf.length) {
// buffer has an incomplete character
} else {
// buffer is OK
}
Substitute your desired encoding for 'utf8'.
Note that this is dependent on how the current implementation of Buffer#toString deals with incomplete characters, which isn't documented, though it's unlikely to be changed in a way that would result in equal-length buffers (a future implementation might throw an error instead, so you should probably wrap the code in a try-catch block).

How to store binary data in a Lua string

I needed to create a custom file format with embedded meta information. Instead of whipping up my own format I decide to just use Lua.
texture
{
format=GL_LUMINANCE_ALPHA;
type=GL_UNSIGNED_BYTE;
width=256;
height=128;
pixels=[[
<binary-data-here>]];
}
texture is a function that takes a table as its sole argument. It then looks up the various parameters by name in the table and forwards the call on to a C++ routine. Nothing out of the ordinary I hope.
Occasionally the files fail to parse with the following error:
my_file.lua:8: unexpected symbol near ']'
What's going on here?
Is there a better way to store binary data in Lua?
Update
It turns out that storing binary data is a Lua string is non-trivial. But it is possible when taking care with 3 sequences.
Long-format-string-literals cannot have an embedded closing-long-bracket (]], ]=], etc).
This one is pretty obvious.
Long-format-string-literals cannot end with something like ]== which would match the chosen closing-long-bracket.
This one is more subtle. Luckily the script will fail to compile if done wrong.
The data cannot embed \n or \r.
Lua's built in line-end processing messes these up. This problem is much more subtle. The script will compile fine but it will yield the wrong data. 0x13 => 0x10, 0x1013 => 0x10, etc.
To get around these limitations I split the binary data up on \r, \n, then pick a long-bracket that works, finally emit Lua that concats the various parts back together. I used a script that does this for me.
input: XXXX\nXX]]XX\r\nXX]]XX]=
texture
{
--other fields omitted
pixels= '' ..
[[XXXX]] ..
'\n' ..
[=[XX]]XX]=] ..
'\r\n' ..
[==[XX]]XX]=]==];
}
Lua is able to encode most characters in long bracket format including nulls. However, Lua opens the script file in text mode and this causes some problems. On my Windows system the following characters have problems:
Char code(s) Problem
-------------- -------------------------------
13 (CR) Is translated to 10 (LF)
13 10 (CR LF) Is translated to 10 (LF)
26 (EOF) Causes "unfinished long string near '<eof>'"
If you are not using windows than these may not cause problems, but there may be different text-mode based problems.
I was only able to produce the error you received by encoding multiple close brackets:
a=[[
]]] --> a.lua:2: unexpected symbol near ']'
But, this was easily fixed with the following:
a=[==[
]]==]
The binary data needs to be encoded into printable characters. The simplest method for decoding purposes would be to use C-like escape sequences for all bytes. For example, hex bytes 13 41 42 1E would be encoded as '\19\65\66\30'. Of course, then the encoded data is three to four times larger than the source binary.
Alternatively, you could use something like Base64, but that would have to be decoded at runtime instead of relying on the Lua interpreter. Personally, I'd probably go the Base64 route. There are Lua examples of Base64 encoding and decoding.
Another alternative would be have two files. Use a well defined image format file (e.g. TGA) that is pointed to by a separate Lua script with the additional metadata. If you don't want two files to move around then they could be combined in an archive.

Resources