Remove BOM from string with Perl - string

I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").
The file looks like this in a hex viewer:
EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).
So I tried to remove it using the following code, but I fail miserably:
$line =~ s/^\xEF\xBB\xBF//;
Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?
Thanks!

EF BB BF is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:
s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//; # Convenient alias
See also: File::Bom.
I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it
You're getting wide character because you forgot to add an :encoding layer on your output file handle. The following adds :encoding(UTF-8) to STDIN, STDOUT, STDERR, and makes it the default for open().
use open ':std', ':encoding(UTF-8)';

To defuse the BOM, you have to know it's not 3 characters, it's 1 in UTF (U+FEFF):
s/^\x{FEFF}//;

If you open the file using File::BOM, it will remove the BOM for you.
use File::BOM;
open_bom(my $fh, $path, ':utf8')

Ideally, your filehandle should be doing this for you automatically. But if you're not in an ideal situation, this worked for me:
use Encode;
my $value = decode('UTF-8', $originalvalue);
$value =~ s/\N{U+FEFF}//;

Related

Unexpected line break even though CR and LF not present in text file

I have this text file which has ASCII characters as well as the bytes 1C and 1D appended at the end. When I open this file in Windows notepad or atom it has line breaks, which are not intended or expected, even though there are no CR or LF characters.
Any explanations and solutions such that the bytes are in the file but do not show up as line breaks?
file.txt
hexdump
Thanks!
It looks to me like it is the length of the file rather than the group/file separators (HEX 1C/1D). Shorter combinations of them lead to fewer displayed 'blank lines'. They also aren't affected by word wrapping like printable characters. Notepad++ handles them much better

What is std::wifstream::getline doing to my wchar_t array? It's treated like a byte array after getline returns

I want to read lines of Unicode text (UTF-16 LE, line feed delimited) from a file. I'm using Visual Studio 2012 and targeting a 32-bit console application.
I was not able to find a ReadLine function within WinAPI so I turned to Google. It is clear I am not the first to seek such a function. The most commonly recommended solution involves using std::wifstream.
I wrote code similar to the following:
wchar_t buffer[1024];
std::wifstream input(L"input.txt");
while (input.good())
{
input::getline(buffer, 1024);
// ... do stuff...
}
input.close();
For the sake of explanation, assume that input.txt contains two UTF-16 LE lines which are less than 200 wchar_t chars in length.
Prior to calling getline the first time, Visual Studio correctly identifies that buffer is an array of wchar_t. You can mouse over the variable in the debugger and see that the array is comprised of 16-bit values. However, after the call to getline returns, the debugger now displays buffer as if is a byte array.
After the first call to getline, the contents of buffer are correct (aside from buffer being treated like a byte array). If the first line of input.txt contains the UTF-16 string L"123", this is correctly stored in buffer as (hex) "31 00 32 00 33 00"
My first thought was to reinterpret_cast<wchar_t *>(buffer) which does produce the desired result (buffer is now treated like a wchar_t array) and it contains the values I expect.
However, after the second call to getline, (the second line of input.txt contains the string L"456") buffer contains (hex) "00 34 00 35 00 36 00". Note that this is incorrect (it should be [hex] 34 00 35 00 36 00)
The fact that the byte ordering gets messed up prevents me from using reinterpret_cast as a solution to work around this. More importantly, why is std::wifstream::getline even converting my wchar_t buffer into a char buffer anyways?? I was under the impression that if one wanted to use chars they would use ifstream and if they want to use wchar_t they use wifstream...
I am terrible at making sense of the stl headers, but it almost looks as if wifstream is intentionally converting my wchar_t to a char... why??
I would appreciate any insights and explanations for understanding these problems.
wifstream reads bytes from the file, and converts them to wide chars using codecvt facet installed into the stream's locale. The default facet assumes system-default code page and calls mbstowcs on those bytes.
To treat your file as UTF-16, you need to use codecvt_utf16. Like this:
std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

Using fwrite in Visual C++ 2005

I'm used to using fwrite to write the contents of a buffer to a file in gcc. However, for some reason, fwrite doesn't seem to work correctly in Visual C++ 2005.
What I'm doing now is to convert a text file into binary. The program worked fine for the first 61 lines, but at the 62nd line, it inserted a 0x0d into the output binary file. Basically, it turned the original
12 0a 00
to
12 0d 0a 00
I checked the buffer, the content was correct, i.e.
buffer[18] = 0x12, buffer[19] = 0x0a, buffer[20] = 0x00
And I tried to write this buffer to file by
fwrite(buffer, 1, length, fout)
where length is the correct value of the size of the buffer content.
This happened to me once, and I had to change my code from fwrite to WriteFile for it to work correctly. Is there a reason a 0x0d is inserted into my output? Can I fix this, or do I have to try using WriteFile instead?
The problem is because the file has been opened in text mode, so it is converting each of the newline characters it sees to a newline+carriage return sequence.
When you open your file, specify binary mode by using the b qualifier on the file mode:
FILE *pFile = fopen(szFilename, "wb");
In VC++, if t or b is not given in the file mode, the default translation mode is defined by the global variable _fmode. This may explain the difference between compilers.
You'll need to do the same when reading a binary file too. In text mode, carriage return–linefeed combinations are translated into single linefeeds on input, and linefeed characters are translated to carriage return–linefeed combinations on output.

What's the best way to diagnose character encoding problems in Vim?

This happens often if I open a plain text file in Vim. I see normal character text, but then � characters here and there, usually where there should just be a space. If I type :set encoding I see encoding=utf-8, and this is correct since I see smart quotes in the text where they should be. What are these � characters and how can I fix how they are displayed?
� is the unicode replacement character. Whenever you use any UTF encoding (UTF-8, UTF-16, UTF-32), all illegal byte sequences for the used UTF encoding are shown as �. Other options are discarding the byte sequences or halting the decoding process completely at first sign of trouble.
For example, the bytes for hellö in ISO-8859-1:
68 65 6c 6c f6
When decoded with UTF-8, becomes hell�. 0xf6 does not ever appear in UTF-8 alone, but the other bytes are completely valid and "by accident" even decode to same characters.

Decoding legacy binary format

I am trying to figure out how to decode a "legacy" binary file that is coming from a Windows application (anno ±1990). Specifically I have a trouble to understand what specific encoding is used for the strings that are being stored.
Example: a unicode string "Düsseldorf" is represented as "Du\06sseldorf" or hex "44 75 06 73 73 65 6C 64 6F 72 66" where everything is single-byte except "u + \06" that mysteriously become an u-umlaut.
Is it completely proprietary? Any ideas?
Since this app pre-dates DBCS and Unicode, I suspect that it is proprietary. It looks like they might be using the non-ASCII values below 31 to represent the various accent marks.
\06 may indicate "put an umlaut on the previous character".
Try replacing the string with "Du\05sseldorf" and see if the accent changes over the u. Then try other escaped values between 1 and 31, and I suspect you may be able to come up with a map for these escape characters. Of course, once you have the map, you could easily create a routine to replace all of the strings with proper modern Unicode strings with the accents in place.

Resources