UTF-8 encoding in NSIS version info - nsis

I have this simple UTF-8 script
Unicode true
VIProductVersion "0.0.0.1"
VIAddVersionKey "ProductName" "Test"
VIAddVersionKey "FileVersion" "0.0.0.1"
VIAddVersionKey "FileDescription" "Test installer"
VIAddVersionKey "LegalCopyright" "Me © 2022"
Section
SectionEnd
The issue is with the © character. After compilation with NSIS 3.08, the copyright info ends up in the installer resources as Me \xC2\xA9 2022
It seems that the source is transformed into UTF-16 byte by byte, rather than character by character. Is there a way how I can ensure proper UTF-16 encoding of that character? I can probably always use (c) instead, but I'm wondering if there's some other way (except using UTF-16 for the whole script)

Unicode true specifies that you want to generate a Unicode installer but it does not change the interpretation of the .nsi file itself (however, it will change the default of !included files).
If the MakeNSIS output includes a line that looks like Processing script file: "C:\Users\Anders\test.nsi" (ACP) then that means the compiler is using the default codepage for ANSI programs when it is parsing your .nsi (for compatibility with NSIS v2).
There are several ways to fix this:
Manually specify the Unicode character codepoint: "Me ${U+A9} 2022".
Make sure your .nsi file has a BOM so it is parsed as UTF-8.
Compile as MakeNSIS /INPUTCHARSET UTF8 MyFile.nsi

Related

Unicode character not visible while doing cat

I have a CSV file generated by a windows system. The file is then moved to linux. The linux environment is NAME="Red Hat Enterprise Linux Server".VERSION="7.3 (Maipo)".ID="rhel".
When I use vi editor, all characters are visible. For example, one line is given :"Sarah--bitte nicht löschen".
But when i cat the file, i get something like "Sarah--bitte nicht l▒schen".
This file is consumed by datastage application and this unicode characters are coming as "?" in datastage. Since cat is not showing the character properly, I believe the issue is at the linux server. Any help is appreciated.
vi reads the file using encoding according fenc setting and show the content using your locales setting ($LANG env). If fenc is different from LANG, vi can handle the translate.
But cat doesn't handle the translate, it always output the exact byte stream without any convert.
Your terminal will show the output content of both vi and cat using your local PC locale setting.

Is there any way to handle ISO-8859-16 / latin10 encoded files in vim?

From the help page section encoding-values:
Supported 'encoding' values are: *encoding-values*
1 latin1 8-bit characters (ISO 8859-1, also used for cp1252)
1 iso-8859-n ISO_8859 variant (n = 2 to 15)
[...]
Somehow, it seems that ISO-8859-16 / latin10 was left out? I fail to read files with that encoding correctly. Am I overlooking anything? If not, can I somehow add support for this character encoding to vim through a plugin or so?
On Windows, my version of Vim is compiled with +iconv/dyn. According to the Vim documentation:
On MS-Windows Vim can be compiled with the +iconv/dyn feature. This
means Vim will search for the "iconv.dll" and "libiconv.dll"
libraries. When neither of them can be found Vim will still work but
some conversions won't be possible.
The most recent version from the DLL from here http://sourceforge.net/projects/gettext/files/libiconv-win32/ seems to do job for me. Without it I could not convert most iso-8859 encodings other than iso-8859-1. Having iconv.dll installed I can load the files easily with:
:e ++enc=iso-8859-16 file.txt
If Vim cannot handle it, you can convert to (for example) UTF-8) with the iconv tool:
$ iconv --from-code ISO-8859-16 --to-code UTF-8 -o outputfile inputfile

encoding problem?

i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..

How to determine encoding table of a text file

I have .txt and .java files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?
If you're on Linux, try file -i filename.txt.
$ file -i vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
For reference, here is my environment:
$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic
Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:
$ file -I vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
Also, have a look here.
Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.
You can't reliably detect the encoding from a textfile - what you can do is make an
educated guess by searching for a non-ascii char and trying to determine if it is a
unicode combination that makes sens in the languages you are parsing.
See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.
In a text file there is no header that saves the encoding or so. You can try the linux/unix command find which tries to guess the encoding:
file -i unreadablefile.txt
or on some systems
file -I unreadablefile.txt
But that often gives you text/plain; charset=iso-8859-1 although the file is unreadable (cryptic glyphs).
This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv. First I tried all encodings, displaying (grep) a line that contained the word www. (a website address):
for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less
This last commandline shows the the tested file encoding and then the translated/transcoded line.
There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:
ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt
In my case it was a chinese windows encoding, which is now readable (if you know chinese).
Does there exist any program to determine the file encoding or to see the encoding?
This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:
% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary
somefile.asc was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit - a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?
if you are using python, the chardet package is a good option, for example
from chardet.universaldetector import UniversalDetector
files = ['a-1.txt','a-2.txt']
detector = UniversalDetector()
for filename in files:
print(filename.ljust(20), end='')
detector.reset()
for line in open(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector.result)
gives me as a result:
a-1.txt {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Command or option for the xgettext, msginit, msgfmt sequence for setting the MIME type?

msgfmt “invalid multibyte sequence” error on a Polish text is corrected by manually editing the MIME Content-Type charset in the template file. Is there some command or option for the xgettext, msginit, msgfmt sequence for setting the MIME type?
cat >plt.cxx <<EOF
// plt.cxx
#include <libintl.h>
#include <locale.h>
#include <iostream>
int main (){
setlocale(LC_ALL, "");
bindtextdomain("plt", ".");
textdomain( "plt");
std::cout << gettext("Invalid input. Enter a string at least 20 characters long.") << std::endl;
}
EOF
g++ -o plt plt.cxx
xgettext --package-name plt --package-version 1.2 --default-domain plt --output plt.pot plt.cxx
sed --in-place plt.pot --expression='s/CHARSET/UTF-8/'
msginit --no-translator --locale pl_PL --output-file plt_polish.po --input plt.pot
sed --in-place plt_polish.po --expression='/#: /,$ s/""/"Nieprawidłowo wprowadzone dane. Wprowadź ciąg przynajmniej 20 znaków."/'
mkdir --parents ./pl_PL.utf8/LC_MESSAGES
msgfmt --check --verbose --output-file ./pl_PL.utf8/LC_MESSAGES/plt.mo plt_polish.po
LANGUAGE=pl_PL.utf8 ./plt
Just give full locale name and msginit will set charset correctly
msginit --no-translator --input=xx.pot --locale=ru_RU.UTF-8
results in
"Language: ru\n"
"Content-Type: text/plain; charset=UTF-8\n"
There is no argument for setting the output character encoding directly, but this should in pratice not be a problem, as your PO editor will automatically use an appropriate character encoding when saving the PO file (one that supports all the characters used in the translation), and replace CHARSET in the file with the name of the encoding. If it doesn’t, file a bug.
The only problem would be if the POT file contained non-ASCII characters, but xgettext does have a --from-code argument for this, which specifies the encoding of the input files. If the input contains non-ASCII characters and --from-code is set to the correct encoding, the output POT file will have the character encoding set to UTF-8 (this need not be equal to the input character encoding). However, if the input files only contain ASCII characters, --from-code=UTF-8 will unfortunately have no effect.
msginit does in fact automatically set the character encoding to something ‘appropriate’ for the chosen target locale. However, the list of locale to character encoding pairs seems outdated; UTF-8 is now really the best choice for all languages.
An alternative would be to use pot2po instead of msginit. This always uses UTF-8 automatically, AFAICS. However, unlike msginit, it does not automatically fill out the plural forms of the PO file, which may or may not be a problem (some think it is the job of the PO editor to do this).

Resources