nroff/groff does not properly convert utf-8 encoded file - linux

I am having a utf-8 encoded roff-file that I want to convert to a manpage with
$ nroff -mandoc inittab.5
However, characters in [äöüÄÖÜ], e.g. are not displayed properly as it seems that nroff assumes ISO 8859-1 encoding (I am getting [äöüÃÃÃ] instead. Calling nroff with the -Tutf8 flag does not change the behaviour and the locale environment variables are (I assume properly) set to
LANG=de_DE.utf8
LC_CTYPE="de_DE.utf8"
LC_NUMERIC="de_DE.utf8"
LC_TIME="de_DE.utf8"
LC_COLLATE="de_DE.utf8"
LC_MONETARY="de_DE.utf8"
LC_MESSAGES="de_DE.utf8"
LC_PAPER="de_DE.utf8"
LC_NAME="de_DE.utf8"
LC_ADDRESS="de_DE.utf8"
LC_TELEPHONE="de_DE.utf8"
LC_MEASUREMENT="de_DE.utf8"
LC_IDENTIFICATION="de_DE.utf8"
LC_ALL=
Since nroff is only a wrapper-script and eventually calls groff I checked the call to the latter which is:
$ groff -Tutf8 -mandoc inittab.5
Comparing the byte-encodings of characters in the src file and the output file I am getting the following conversions:
character src file output file
--------- -------- -----------
ä C3 A4 C3 83 C2 A4
ö C3 B6 C3 83 C2 B6
ü C3 BC C3 83 C2 BC
Ä C3 84 C3 83
Ö C3 96 C3 83
Ü C3 9C C3 83
ß C3 9F C3 83
This behaviour seems very weird to me (why am I getting an additional C3 83 and have the original byte-sequence truncated alltogether for big umlauts and ß?)
Why is this and how can I make nroff/groff properly convert my utf-8 encoded file?
EDIT: I am using GNU nroff (groff) version 1.22.2

Unlike other troff implementations (namely Plan 9 and Heirloom troff), groff does not support UTF8 in documents. However, UTF8 output can be achieved using the preconv(1) pre-processor, which converts UTF8 characters in a file to groff native escape sequences.
Take for example this groff_ms(7) document:
.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the café down the street
äöüÄÖÜ
Using groff normally, we get:
StackOverflow Test Document
ToasterKing
I like going to the café down the street
äöüÃÃÃ
But when using preconv | groff or groff -k, we get:
StackOverflow Test Document
ToasterKing
I like going to the café down the street
äöüÄÖÜ
Looking at the output of preconv, you can see how it transforms characters into escape sequences:
.lf 1 so.ms
.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the caf\[u00E9] down the street
\[u00E4]\[u00F6]\[u00FC]\[u00C4]\[u00D6]\[u00DC]

Related

Put space every two characters in text string

Given the following string:
6e000000b0040000044250534bb4f6fd02d6dc5bc0790c2fde3166a14146009c8684a4624
Which is a representation of a Byte array, every two characters represent a Byte.
I would like to put a space between each Byte using Sublime Text, something like:
6e 00 00 00 b0 04 00 00 04 42 50
Does Sublime Text help me on that issue?
As a bonus I would like to split into lines and add 0x before each Byte.
I found a similar question but that's not related no Sublime Text, Split character string multiple times every two characters.
Go to Find->Replace... and enable regular expressions.
Replace: (.{2})
With: $1SPACE
Where SPACE is a space.
To split it onto separate lines and add 0x before each byte do this:
Find (.{2})
Replace with: 0x\1\n

How do I remove specific byte sequences with sed and vim using hex addresses?

I've got a string, looks like this in vim
PFLUGERVILLE TX 7x691 227 12515 <83>¨¨ x Research Boulevard
For reference in vim,
ga Print the ascii value of the character under the cursor in decimal, hexadecimal and octal.
g8 Print the hex values of the bytes used in the character under the cursor, assuming it is in UTF-8 encoding. This also shows composing characters. The value of 'maxcombine' doesn't matter.
I can inspect it if I go over it <83> and type ga, I get this
<<83>> 131, Hex 0083, Octal 203
If I type g8, I get
c2 83
I would have thought that
sed -e's/\x00\x83//g' ./file.csv
would work to remove the character, but no joy.
Not
sed -e's/\x00\x83//g' ./file.csv
but,
LC_CTYPE=C sed -e's/\x83//g' ./file.csv
You have to use LC_CTYPE=C, and remove the starting \x00.

BASH - Convert textfile containing binary numbers into a binary file

I have a long text file that looks something like this:
00000000
00001110
00010001
00010000
00001110
00000001
00010001
00001110
...and so on...
I'd like to take this data that is represented in ASCII and write it to a binary file. That is, i do NOT want to convert ASCII to binary, but rather take the actual 1s and 0s and put them in a binary file
The purpose of this is so that my EPROM programmer can read the file.
I've heard that ob and hexdump are useful in this case but I never really understood how they worked.
If it's to any help I also have the data in hex form:
00 0E 11 10 0E 01 11 0E
How do I do this using a shell script?
Something like perl -ne 'print pack "B*", $_' input should get you most of the way there.

Remove BOM from string with Perl

I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").
The file looks like this in a hex viewer:
EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).
So I tried to remove it using the following code, but I fail miserably:
$line =~ s/^\xEF\xBB\xBF//;
Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?
Thanks!
EF BB BF is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:
s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//; # Convenient alias
See also: File::Bom.
I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it
You're getting wide character because you forgot to add an :encoding layer on your output file handle. The following adds :encoding(UTF-8) to STDIN, STDOUT, STDERR, and makes it the default for open().
use open ':std', ':encoding(UTF-8)';
To defuse the BOM, you have to know it's not 3 characters, it's 1 in UTF (U+FEFF):
s/^\x{FEFF}//;
If you open the file using File::BOM, it will remove the BOM for you.
use File::BOM;
open_bom(my $fh, $path, ':utf8')
Ideally, your filehandle should be doing this for you automatically. But if you're not in an ideal situation, this worked for me:
use Encode;
my $value = decode('UTF-8', $originalvalue);
$value =~ s/\N{U+FEFF}//;

Decoding legacy binary format

I am trying to figure out how to decode a "legacy" binary file that is coming from a Windows application (anno ±1990). Specifically I have a trouble to understand what specific encoding is used for the strings that are being stored.
Example: a unicode string "Düsseldorf" is represented as "Du\06sseldorf" or hex "44 75 06 73 73 65 6C 64 6F 72 66" where everything is single-byte except "u + \06" that mysteriously become an u-umlaut.
Is it completely proprietary? Any ideas?
Since this app pre-dates DBCS and Unicode, I suspect that it is proprietary. It looks like they might be using the non-ASCII values below 31 to represent the various accent marks.
\06 may indicate "put an umlaut on the previous character".
Try replacing the string with "Du\05sseldorf" and see if the accent changes over the u. Then try other escaped values between 1 and 31, and I suspect you may be able to come up with a map for these escape characters. Of course, once you have the map, you could easily create a routine to replace all of the strings with proper modern Unicode strings with the accents in place.

Resources