Decoding legacy binary format - string

I am trying to figure out how to decode a "legacy" binary file that is coming from a Windows application (anno ±1990). Specifically I have a trouble to understand what specific encoding is used for the strings that are being stored.
Example: a unicode string "Düsseldorf" is represented as "Du\06sseldorf" or hex "44 75 06 73 73 65 6C 64 6F 72 66" where everything is single-byte except "u + \06" that mysteriously become an u-umlaut.
Is it completely proprietary? Any ideas?

Since this app pre-dates DBCS and Unicode, I suspect that it is proprietary. It looks like they might be using the non-ASCII values below 31 to represent the various accent marks.
\06 may indicate "put an umlaut on the previous character".
Try replacing the string with "Du\05sseldorf" and see if the accent changes over the u. Then try other escaped values between 1 and 31, and I suspect you may be able to come up with a map for these escape characters. Of course, once you have the map, you could easily create a routine to replace all of the strings with proper modern Unicode strings with the accents in place.

Related

Does reading a binary file linewise in python cause problems for unicode data?

I'm reading a large (10Gb) bzipped file in python3, which is utf-8-encoded JSON. I only want a few of the lines though, that start with a certain set of bytes, so to save having to decode all the lines into unicode, I'm reading the file in 'rb' mode, like this:
with bz2.open(filename, 'rb') as file:
for line in file:
if line.startswith(b'Hello'):
#decode line here, then do stuff
But I suddenly thought, what if one of the unicode characters contains the same byte as a newline character? By doing for line in file will I risk getting truncated lines? Or does the linewise iterator over a binary file still work by magic?
Line-wise iteration will work for UTF-8 encoded data.
Not by magic, but by design:
UTF-8 was created to be backwards-compatible to ASCII.
ASCII only uses the byte values 0 through 127, leaving the upper half of possible values for extensions of any kind.
UTF-8 takes advantage of this, in that any Unicode codepoint outside ASCII is encoded using bytes in the range 128..255.
For example, the letter "Ċ" (capital letter C with dot above) has the Unicode codepoint value U+010A.
In UTF-8, this is encoded with the byte sequence C4 8A, thus without using the byte 0A, which is the ASCII newline.
In contrast, UTF-16 encodes the same character as 0A 01 or 01 0A (depending on the Endianness).
So I guess UTF-16 is not safe to do line-wise iteration over.
It's not that common as file encoding though.

Put space every two characters in text string

Given the following string:
6e000000b0040000044250534bb4f6fd02d6dc5bc0790c2fde3166a14146009c8684a4624
Which is a representation of a Byte array, every two characters represent a Byte.
I would like to put a space between each Byte using Sublime Text, something like:
6e 00 00 00 b0 04 00 00 04 42 50
Does Sublime Text help me on that issue?
As a bonus I would like to split into lines and add 0x before each Byte.
I found a similar question but that's not related no Sublime Text, Split character string multiple times every two characters.
Go to Find->Replace... and enable regular expressions.
Replace: (.{2})
With: $1SPACE
Where SPACE is a space.
To split it onto separate lines and add 0x before each byte do this:
Find (.{2})
Replace with: 0x\1\n

Remove BOM from string with Perl

I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").
The file looks like this in a hex viewer:
EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).
So I tried to remove it using the following code, but I fail miserably:
$line =~ s/^\xEF\xBB\xBF//;
Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?
Thanks!
EF BB BF is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:
s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//; # Convenient alias
See also: File::Bom.
I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it
You're getting wide character because you forgot to add an :encoding layer on your output file handle. The following adds :encoding(UTF-8) to STDIN, STDOUT, STDERR, and makes it the default for open().
use open ':std', ':encoding(UTF-8)';
To defuse the BOM, you have to know it's not 3 characters, it's 1 in UTF (U+FEFF):
s/^\x{FEFF}//;
If you open the file using File::BOM, it will remove the BOM for you.
use File::BOM;
open_bom(my $fh, $path, ':utf8')
Ideally, your filehandle should be doing this for you automatically. But if you're not in an ideal situation, this worked for me:
use Encode;
my $value = decode('UTF-8', $originalvalue);
$value =~ s/\N{U+FEFF}//;

What is std::wifstream::getline doing to my wchar_t array? It's treated like a byte array after getline returns

I want to read lines of Unicode text (UTF-16 LE, line feed delimited) from a file. I'm using Visual Studio 2012 and targeting a 32-bit console application.
I was not able to find a ReadLine function within WinAPI so I turned to Google. It is clear I am not the first to seek such a function. The most commonly recommended solution involves using std::wifstream.
I wrote code similar to the following:
wchar_t buffer[1024];
std::wifstream input(L"input.txt");
while (input.good())
{
input::getline(buffer, 1024);
// ... do stuff...
}
input.close();
For the sake of explanation, assume that input.txt contains two UTF-16 LE lines which are less than 200 wchar_t chars in length.
Prior to calling getline the first time, Visual Studio correctly identifies that buffer is an array of wchar_t. You can mouse over the variable in the debugger and see that the array is comprised of 16-bit values. However, after the call to getline returns, the debugger now displays buffer as if is a byte array.
After the first call to getline, the contents of buffer are correct (aside from buffer being treated like a byte array). If the first line of input.txt contains the UTF-16 string L"123", this is correctly stored in buffer as (hex) "31 00 32 00 33 00"
My first thought was to reinterpret_cast<wchar_t *>(buffer) which does produce the desired result (buffer is now treated like a wchar_t array) and it contains the values I expect.
However, after the second call to getline, (the second line of input.txt contains the string L"456") buffer contains (hex) "00 34 00 35 00 36 00". Note that this is incorrect (it should be [hex] 34 00 35 00 36 00)
The fact that the byte ordering gets messed up prevents me from using reinterpret_cast as a solution to work around this. More importantly, why is std::wifstream::getline even converting my wchar_t buffer into a char buffer anyways?? I was under the impression that if one wanted to use chars they would use ifstream and if they want to use wchar_t they use wifstream...
I am terrible at making sense of the stl headers, but it almost looks as if wifstream is intentionally converting my wchar_t to a char... why??
I would appreciate any insights and explanations for understanding these problems.
wifstream reads bytes from the file, and converts them to wide chars using codecvt facet installed into the stream's locale. The default facet assumes system-default code page and calls mbstowcs on those bytes.
To treat your file as UTF-16, you need to use codecvt_utf16. Like this:
std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

What's the best way to diagnose character encoding problems in Vim?

This happens often if I open a plain text file in Vim. I see normal character text, but then � characters here and there, usually where there should just be a space. If I type :set encoding I see encoding=utf-8, and this is correct since I see smart quotes in the text where they should be. What are these � characters and how can I fix how they are displayed?
� is the unicode replacement character. Whenever you use any UTF encoding (UTF-8, UTF-16, UTF-32), all illegal byte sequences for the used UTF encoding are shown as �. Other options are discarding the byte sequences or halting the decoding process completely at first sign of trouble.
For example, the bytes for hellö in ISO-8859-1:
68 65 6c 6c f6
When decoded with UTF-8, becomes hell�. 0xf6 does not ever appear in UTF-8 alone, but the other bytes are completely valid and "by accident" even decode to same characters.

Resources