What does LC_CTYPE locale translation actually do? - locale

I've been trying to understand what LC_CTYPE actually does - can anyone tell me if this is true and if so, point me at the documentation that explains this?
It seems that if I have a locale of en_US.utf8 and I try to print an extended (>=128) ASCII character, I get the character I expect except if I do this:
LC_CTYPE=C <my-command>
If I do this then I seem to get the 2-byte UTF-8 representation of the character.
So does this mean that:
There is some locale that everything translates TO, and if so, what is it?
LC_CTYPE is defining what I am translating FROM so if I set LC_CTYPE to UTF-8 already, it assumes I don't need any translation.
Thanks.

Related

unidentified characters in terminal

I faced these strange abnormal characters when I was trying to calculate PI number in terminal over a Beowulf cluster.
how can I convert these characters into some legible characters?
it's interesting that when I make less processes , the result is normal .
Thanks in advance.
edit:
This was done with mpich 1 and with 1000 processes over a 3-computer cluster.
Because the output has lots of Unicode replacement characters, it looks as if the locale settings on your machine are not set to use UTF-8 encoding.
Of course, it could simply be from attempting to print binary data on the terminal. But locale is a possibility. In either case, the terminal is running with UTF-8 encoding and your output is not valid UTF-8 text.
Resetting the terminal will not be helpful; it is the application (or your use of it) which is the problem.
Further reading:
Overcoming frustration: Correctly using unicode in python2
Avoid printing unicode replacement character in Java

Lua with Traditional Chinese

I convert some lua script(it contains chinese character) from simplified chinese to traditional chinese, and now the chinese character is encoding with cp950.
Now I switch my win7 machine locale to zh_TW, and restart. Everything seems okay, the script with traditional chinese character is correct displayed.
But when I complied these script.It is error.Invalid escape string.
for example:
msg="外功系普攻攻擊"
print(msg)
the result is:
外巨t普攻攻擊
Look at the hex of the string, it is
\xa5~\xa5\\\xa8t\xb4\xb6\xa7\xf0\xa7\xf0\xc0\xbb
so it is the lua not escape the string.
Now the problem is, can I solve it? How can I let the script compile success? My source script can not encoding into utf-8, if can, it is esay.
This is a very very famous cp950 (Big5) encoding problem called "許功蓋" issue, see http://zh.wikipedia.org/wiki/%E5%A4%A7%E4%BA%94%E7%A2%BC
So utf8 is the the best slove to it.

Set encoding and fileencoding to utf-8 in Vim

What is the difference between these two commands?
set encoding=utf-8
set fileencoding=utf-8
Do I need to set both when I want to use utf-8?
Also, do I need to set fileencoding with set or setglobal?
TL;DR
In the first case with set encoding=utf-8, you'll change the output encoding that is shown in the terminal.
In the second case with set fileencoding=utf-8, you'll change the output encoding of the file that is written.
As stated by #Dennis, you can set them both in your ~/.vimrc if you always want to work in utf-8.
More details
From the wiki of VIM about working with unicode
"encoding sets how vim shall represent characters internally. Utf-8 is necessary for most flavors of Unicode."
"fileencoding sets the encoding for a particular file (local to buffer); :setglobal sets the default value. An empty value can also be used: it defaults to same as 'encoding'. Or you may want to set one of the ucs encodings, It might make the same disk file bigger or smaller depending on your particular mix of characters. Also, IIUC, utf-8 is always big-endian (high bit first) while ucs can be big-endian or little-endian, so if you use it, you will probably need to set 'bomb" (see below)."
set encoding=utf-8 " The encoding displayed.
set fileencoding=utf-8 " The encoding written to file.
You may as well set both in your ~/.vimrc if you always want to work with utf-8.
You can set the variable 'fileencodings' in your .vimrc.
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it. If all fail, 'fileencoding' is set to
an empty string, which means the value of 'encoding' is used.
See :help filencodings
If you often work with e.g. cp1252, you can add it there:
set fileencodings=ucs-bom,utf-8,cp1252,default,latin9

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?
As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!
The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}
You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

Using vim+LaTeX with Scandinavian characters

I want to create a lab write-up with LaTeX in Ubuntu, however my text includes Scandinavian characters and at present I have to type them in using /"a and "/o etc. Is it possible to get the latex-compiler to read these special characters when they are typed in as is? Additionally, I would like vim to "read" Finnish: Now when I open a .tex-document containing Scandinavian characters, they are not displayed at all in vim. How can I correct this?
For latex, use the inputenc option:
\usepackage[utf8]{inputenc}
Instead of utf8, you may use whatever else fits you, like latin1, as well.
Now the trick is to make your terminal run the same character encoding. It seems that it runs a character/input encoding that doesn't fit your input right now.
For this, refer to the "Locale" settings of your distribution. You can always check the locale settings in the terminal by issueing locale. These days, UTF8 locales are preferred as they work with every character imaginable. If your terminal's environment is set up correctly, vim should happily work with all your special characters without mourning.
To find out in which encoding Vim thinks the document is, try:
:set enc
To set the encoding to UTF-8, try:
:set enc=utf8
I can't help with vim, but for LaTeX I recommend you check out XeTeX, which is an extension of TeX that is designed to support Unicode input. XeTeX is now part of Texlive, so if you have TeX installed chances are you already have it.
I use the UCS unicode support: http://iamleeg.blogspot.com/2007/10/nice-looking-latex-unicode.html

Resources