GNU gettext msgfilter program says "invalid multibyte sequence" - locale

GNU gettext program msgfilter seems not to accept UTF8 string as a result of the script that is given as a filter. The script just returns a prepared text which is read from a file.
Here is the test setup:
echo '#!/bin/bash
cat /tmp/t3.txt
' > /tmp/trans01.sh
chmod a+rwx /tmp/trans01.sh
Then there is a file /tmp/t3.txt:
cat /tmp/t3.txt
Result:
AMSTERDAM REISEFÜHRER FÜR REISE, UNTERKUNFT, SEHENSWÜRDIGKEITEN
It is utf-8 file:
file /tmp/t3.txt
Gives:
/tmp/t3.txt: UTF-8 Unicode text
Further:
echo 'msgid "kk71ams_amsterdam_main_page_title"
msgstr "AMSTERDAM TOURIST GUIDE FOR TRAVEL, ACCOMMODATION, ATTRACTIONS"
' > /tmp/te1.po
Than:
cat /tmp/te1.po
Gives:
msgid "kk71ams_amsterdam_main_page_title"
msgstr "AMSTERDAM TOURIST GUIDE FOR TRAVEL, ACCOMMODATION, ATTRACTIONS"
Than:
file /tmp/te1.po
Gives:
/tmp/te1.po: GNU gettext message catalogue, ASCII text
Locale:
:~# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
Now the problem with 'msgfilter':
~# msgfilter -i /tmp/te1.po '/tmp/trans01.sh'
msgid "kk71ams_amsterdam_main_page_title"
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
/tmp/te1.po:2: invalid multibyte sequence
msgstr "AMSTERDAM REISEFHRER FR REISE, UNTERKUNFT, SEHENSWRDIGKEITEN\n"

Not exactly the same situation, but I had the same issue and I solved it by adding the correct Content-type.
I had:
"Content-Type: text/plain; charset=ASCII\n"
This seesms to be the default.
And changed it to:
"Content-Type: text/plain; charset=UTF-8\n"
Even though my file was also UTF-8, I explicitly had to change the charset in the Content-Type

I had the same problem, I just resolved it.
You may have forgot --keep-header in msgfilter command
this flags keep the header from the source file in the output file, if not done the output po file is considered as a ASCII file apparently.
now I do:
msgfilter --keep-header -i mvap.po -o en_US.po ./script_merge_translate.sh
and it works

Related

utf8 in erlang format becomes \x (backslash x) ascii encoded

I want to print a utf8 list on my Linux terminal.
-module('main').
-export([main/1]).
main(_) ->
Text = "あいうえお",
io:format("~ts~n", [Text]),
halt().
When I compile and run on Ubuntu22.04,
$ erlc main.erl
$ erl -noshell -run main main run
\x{3042}\x{3044}\x{3046}\x{3048}\x{304A}
it shows as \x{3042} instead of あ.
In utf8, "あいうえお" should have 15 Bytes.
How can I split \x{3042} into 3 Bytes and print あ?
"あ" is a Japanese character by the way.
list_to_bin didn't work for unicode.
I found unicode:characters_to_list that converts bin to list for unicode.
Couldn't find the opposite.
If you want to use Erlang's Unicode output, then remove the -noshell. Adding +pc unicode is also good practice.
$ erl +pc unicode -run main main run
Erlang/OTP 24 [erts-12.2.1] [source] [64-bit] ...
あいうえお
In Erlang you can specify a binary as utf8. For example, to see the three bytes binary representation of the Japanese character "あ".
1> <<"あ"/utf8>>.
<<227,129,130>>
In your example, to take the first glyph of your string.
1> Text = "あいうえお".
[12354,12356,12358,12360,12362]
2> unicode:characters_to_binary(Text, unicode, utf8).
<<227,129,130,227,129,132,227,129,134,227,129,136,227,129,138>>
3> binary:part(unicode:characters_to_binary(Text, unicode, utf8),0,3).
<<227,129,130>>
4> io:format("~ts~n",[binary:part(unicode:characters_to_binary(Text, unicode, utf8),0,3)]).
あ
To save unicode to a file, use erlang's file encoding options.
5> {ok,G} = file:open("/tmp/unicode.txt",[write,{encoding,utf8}]).
{ok,<0.148.0>}
6> io:put_chars(G,Text).
ok
7> file:close(G).
Then in a shell
$ file /tmp/unicode.txt
/tmp/unicode.txt: Unicode text, UTF-8 text, with no line terminators
$ cat /tmp/unicode.txt
あいうえお

Linux setxattr: possible to use Unicode string?

I wrote the following code in VS Code and ran it to set file attribute. It seemed to have run successfully, but when I checked the value, the text was not correct. Is Unicode string supported for file extended attributes? If so, how can I fix the code below?
#include <stdio.h>
#include <sys/xattr.h>
int main()
{
printf("ねこ\n");
ssize_t res = setxattr("/mnt/cat/test.txt", "user.dog"
, "ねこ", 2, 0); /*also tested 4 and 8*/
printf("Result = %lu\n", (unsigned long)res);
return 0;
}
Programme output
ねこ
Result = 0
Reading attribute
$ getfattr test.txt -d
# file: test.txt
user.dog=0s44E=
Obviously ねこ can't be stored in 2 bytes. The characters are U+306D and U+3053, encoded in UTF-8 as E3 81 AD E3 81 93 so length must be set to 6. If you did that you'll see that getfattr test.txt -d outputs
user.dog=0s44Gt44GT
That's because -d doesn't what format the data is in and just dumps it as binary. The 0s prefix means that the data is in base64 as stated from the manpage:
-d, --dump
Dump the values of all matched extended attributes.
-e en, --encoding=en
Encode values after retrieving them. Valid values of en are "text", "hex", and "base64". Values encoded as text strings are enclosed in double quotes ("), while strings encoded as hexidecimal and base64 are prefixed with 0x and 0s, respectively.
Just plug 44Gt44GT into any base64 decoder or run echo 44Gt44GT | base64 --decode and you'll see the correct string printed out. To see the string directly from getfattr you need to specify the format with -e text
$ getfattr -n user.dog -e text test.txt
# file: test.txt
user.dog="ねこ"

Printing a string containing utf-8 byte sequences on perl

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?
The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);

Why $'\0' or $'\x0' is an empty string? Should be the null-character, isn't it?

bash allows $'string' expansion. My man bash says:
Words of the form $'string' are treated specially.
The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard.
Backslash escape sequences, if present, are decoded as follows:
\a alert (bell)
\b backspace
\e
\E an escape character
\f form feed
\n new line
\r carriage return
\t horizontal tab
\v vertical tab
\ backslash
\' single quote
\" double quote
\nnn the eight-bit character whose value is the octal value nnn (one to three digits)
\xHH the eight-bit character whose value is the hexadecimal value HH (one or two hex digits)
\cx a control-x character
The expanded result is single-quoted, as if the dollar sign had not been present.
But why does bash not convert $'\0' and $'\x0' into a null character?
Is it documented? Is there a reason? (Is it a feature or a limitation or even a bug?)
$ hexdump -c <<< _$'\0'$'\x1\x2\x3\x4_'
0000000 _ 001 002 003 004 _ \n
0000007
echo gives the expected result:
> hexdump -c < <( echo -e '_\x0\x1\x2\x3_' )
0000000 _ \0 001 002 003 _ \n
0000007
My bash version
$ bash --version | head -n 1
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Why echo $'foo\0bar' does not behave as echo -e 'foo\0bar'?
It's a limitation. bash does not allow string values to contain interior NUL bytes.
Posix (and C) character strings cannot contain interior NULs. See, for example, the Posix definition of character string (emphasis added):
3.92 Character String
A contiguous sequence of characters terminated by and including the first null byte.
Similarly, standard C is reasonably explicit about the NUL character in character strings:
§5.2.1p2 …A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
Posix explicitly forbids the use of NUL (and /) in filenames (XBD 3.170) or in environment variables (XBD 8.1 "... are considered to end with a null byte."
In this context, shell command languages, including bash, tend to use the same definition of a character string, as a sequence of non-NUL characters terminated by a single NUL.
You can pass NULs freely through bash pipes, of course, and nothing stops you from assigning a shell variable to the output of a program which outputs a NUL byte. However, the consequences are "unspecified" according to Posix (XSH 2.6.3 "If the output contains any null bytes, the behavior is unspecified."). In bash, the NULs are removed, unless you insert a NUL into a string using bash's C-escape syntax ($'\0'), in which case the NUL will end up terminating the value.
On a practical note, consider the difference between the two following ways of attempting to insert a NUL into the stdin of a utility:
$ # Prefer printf to echo -n
$ printf $'foo\0bar' | wc -c
3
$ printf 'foo\0bar' | wc -c
7
$ # Bash extension which is better for strings which might contain %
$ printf %b 'foo\0bar' | wc -c
7
But why does bash not convert $'\0' and $'\x0' into a null character?
Because a null character terminates a string.
$ echo $'hey\0you'
hey
It is a null character, but it depends on what you mean by that.
The null character represents an empty string, which is what you get when you expand it. It is a special case and I think that is implied by the documentation but not actually stated.
In C binary zero '\0' terminates a string and on its own also represents an empty string. Bash is written in C, so it probably follows from that.
Edit: POSIX mentions a null string in a number of places. In the "Base definitions" it defines a null string as:
3.146 Empty String (or Null String)
A string whose first byte is a null byte.

How to convert xml file which is in non UTF-8 format to xml that is UTF-8 compliant

I have a huge xml file whose sample data is as follows :
<vendor name="aglaia"><br>
<vendorOUI oui="000B91" description="Aglaia Gesellschaft für Bildverarbeitung ud Kommunikation m" /><br>
</vendor><br>
<vendor name="ag"><br>
<vendorOUI oui="0024A9" description="Ag Leader Technology" /><br>
</vendor><br>
as it can be see there are text " Gesellschaft für Bildverarbeitung " which is not UTF-8 compliant because which I am getting errors from the xml validator , errors like:
Import failed:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
So the query is how to take care of this in Linux environment to convert the xml file to UTF-8 compliant format? or is there a way in bash such that while creating the xml in the first place i can ensure that all variables/strings are stored in UTF-8 compliant format?
Use the character set conversion tool:
iconv -f ISO-8859-1 -t UTF-8 filename.txt
See gnu-page
...and in file http://standards.ieee.org/develop/regauth/oui/oui.txt "aglia" (as in your example above) is reported as:
00-0B-91 (hex) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m
000B91 (base 16) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m
Tiniusstr. 12-15
Berlin D-13089
GERMANY
it seems like "ü" is the character that gets mangeld.
Update
When downloading "oui.txt" using wget, I see the character "ü" in the file. If you don't have that something is broken in your download. consider using one of these:
wget --header='Accept-Charset: utf-8'
try using curl -o oui.txt instead
If none of the above works, just open the link in you favorite browser and do a "save as". In that case, comment the wget line in the script below.
I had success with the following script (update BEGIN & END to get a valid XML-file)
#!/bin/bash
wget http://standards.ieee.org/develop/regauth/oui/oui.txt
iconv -f iso-8859-15 -t utf-8 oui.txt > converted
awk 'BEGIN {
print "HTML-header"
}
/base 16/ {
printf("<vendor name=\"%s\">\n", $4)
read
desc = substr($0, index($0, $4))
printf("<vendorOUI oui=\"%s\" description=\"%s\"/>\n", $1, desc)
}
END {
print "HTML-footer"
}
' converted
Hope this helps!

Resources