How does VIM perform charset conversion? - vim

I see the following paragraph on the vim documentation for the introduction of charset conversion:
Vim will automatically convert from one to another encoding in several places:
- When reading a file and 'fileencoding' is different from 'encoding'
- When writing a file and 'fileencoding' is different from 'encoding'
- When displaying characters and 'termencoding' is different from 'encoding'
- When reading input and 'termencoding' is different from 'encoding'
- When displaying messages and the encoding used for LC_MESSAGES differs from
'encoding' (requires a gettext version that supports this).
- When reading a Vim script where |:scriptencoding| is different from
'encoding'.
- When reading or writing a |viminfo| file.
I want to know who is converting to who? such as:
"When reading a file and 'fileencoding' is different from 'encoding'"
Is 'fileencoding' converted to 'encoding'? Or is 'encoding' converted to 'fileencoding'?
What is the relationship between the actual charset of the file and fileencoding and encoding?
If the actual charset of the file and the value of fileencoding are not equal, will the above conversion operations destroy the contents of the file?
UPDATE:
For example: the value of encoding is: utf-8 , vim opens a file: foo, and based on fileencodings matches a fileencoding value: sjis (assuming i don't know the actual encoding of this file.), I edited foo and used ":wq" to save and close the vim window. If I open the foo file again, is the actualencoding of this file the sjis specified by fileencoding or the utf-8 specified by encoding when I last edited?

'encoding' is the internal representation of any buffer text inside Vim; this is what Vim is working on. When you're dealing with different character sets (or if you don't care and work on a modern operating system), it's highly recommended to set this to utf-8, as the Unicode encoding ensures that any character can be represented and no information is lost. (And UTF-8 is the only Unicode representation that Vim internally supports; i.e. you cannot make it use a double-byte encoding like UTF-16.)
When you open a file in Vim, the list of possible encodings in 'fileencodings' (note the plural!) is considered:
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it.
So if a file doesn't look right, this is the option to tweak; alternatively, you can explicitly override the detection via the ++enc argument, e.g.
:edit ++enc=sjis japanese.txt
Now, Vim has the file's source encoding (persisted in (singular!) 'fileencoding'; this is needed for writing it back in the original encoding), and converts the character set (if different) to it's internal 'encoding'. All Vim commands operate on that, and on :write, the conversion happens in reverse (or optionally overridden by :w ++enc=...).
Conclusions
As long as the detected / passed encoding is right, and assuming the internal 'encoding' is able to represent all read characters (guaranteed with utf-8), there will be no data loss.
Likewise, as the original encoding is stored in 'fileencoding', writes of the file transparently convert back. Now, it could have happened that editing introduced a character that cannot be represented in the file's encoding (but you were able to edit it in because of Vim's internal Unicode encoding). Vim will then print E513: write error, conversion failed on writing, and you have to manually change the character(s), or choose a different target file encoding.
Example
A file with these Kanji characters 日本 is represented as follows in the SJIS encoding:
93fa 967b 0a
Each Kanji is stored in two bytes, and then you have the one-byte newline (LF) at the end.
With :set encoding=utf-8, this is represented internally as (g8 can tell you this):
e697 a5e6 9cac 0a
In UTF-8, each Kanji is stored in three bytes, the first Kanji is e6 97 a5.
Now if I edit the text, e.g. enclosing with (ASCII) parentheses, and :write, I get this:
2893 fa96 7b29 0a
The original SJIS encoding is restored, each Kanji is two bytes again, now with the added parentheses 28 and 29 around it.
Had I tried to edit in a ä character, the :write would have failed with the E513 error, as that character cannot be represented in SJIS.

Related

How to treat multibyte characters simply as a sequence of bytes?

I would like to use vim with binary files. I run run vim with -b and I have isprint = and display += uhex. I am using the following statusline:
%<%f\ %h%m%r%=%o\ (0x%06O)\ \ %3.b\ <%02B>\ %7P
so I get output containing some useful information like byte offset in the file and the current character in hex etc. But I'm having trouble with random pieced of data interpreted as multibyte characters which prevent me from accessing the inner bytes, combine with surroundings (including vim's decoration) or display as �.
Of course I have tried opening the files with ++enc=latin1. However, my system's encoding is UTF-8, so what vim supposedly does is convert the file from Latin-1 to UTF-8 internally and display that. This has two problems:
The sequence <c3><ac> displays as ì, rather than ì, but the characters count as two bytes each, so it breaks my %o and counts offsets wrong. This is 2 bytes in the file but apparently 4 bytes in vim's buffer.
I don't know why my isprint is ignored. Neither of these characters are between 32 and 126 so they should display in hex.
I found the following workaround: I set encoding to latin1, but termencoding to utf-8. This achieves what I want, but breaks other things like when vim needs to display status messages ("new file", "changed" etc.) in my language, because it wants to use the encoding for them too and they don't fit. I guess I could run vim in LC_ALL=C but it feels I'm resorting to too many dirty tricks already. Is there a better way, i.e., without having to mess around with encoding?

Changing default encoding of vim to utf-8 not working

I've been trying to change the default encoding of vim by writing in the $HOME/.vimrc file the following lines
set fileencodings=utf-8
set fileencoding=utf-8
I understand that the first line makes vim try to read a file with utf-8 encoding and the second line makes vim save files always with an utf-8 encoding. But when I write a file called example.txt and write no special characters, I save it to verify
file -i example.txt
output:
example.txt: text/plain; charset=us-ascii
If I wrote then special characters on example.txt then it would display correctly
example.txt: text/plain; charset=utf-8
But I want always the encoding to be utf-8 even if the file does not have special characters. Why is it not working?
file looks at the content of the file to determine its encoding. If it only finds ASCII characters it can only conclude that the file is ASCII.
ASCII being a subset of UTF-8 (the basis of a number of other encodings), it is simply impossible for any program to tell if 123abc is anything other than ASCII.
Of course, if you add UTF-8 characters to that file, file will spot them and act accordingly.
So… on to the Vim side of the "problem".
fileencodings is a list of encodings considered by Vim when reading a file.
fileencoding is the encoding used by Vim when writing a specific file.
The default value of both options depends on the value of encoding, which is set during startup.
The "ideal" encoding is utf-8. With this, you get a sensible list for fileencodings: ucs-bom,utf-8,default,latin1, and a sensible value for fileencoding: utf-8 that pretty much guarantee a smooth experience as long as you stay within the confines of UTF-8.
See:
:help 'encoding'
:help 'fileencodings'
:help 'fileencoding'

(VIM) Is vimgrep capable of searching unicode string

Is vimgrep capable of searching unicode strings?
For example:
a.txt contains wide string "hello", vimgrep hello *.txt found nothing, and of course it's in the right path.
"Unicode" is a bit misleading in this case. What you have is not at all typical of text "encoded in accordance with any of the method provided by the Unicode standard". It's a bunch of normal characters with normal code points separated with NULL characters with code point 0000 or 00. Some Java programs do output that kind of garbage.
So, if your search pattern is hello, Vim and :vim are perfectly capable of searching for and finding hello (without NULLs) but they won't ever find hello (with NULLs).
Searching for h^#e^#l^#l^#o (^# is <C-v><C-#>), on the other hand, will find hello (with NULLs) but not hello (without NULLs).
Anyway, converting that file/buffer or making sure you don't end up with such a garbage are much better long-term solutions.
If Vim can detect the encoding of the file, then yes, Vim can grep the file. :vimgrep works by first reading in the file as normal (even including autocmds) into a hidden buffer, and then searching the buffer.
It looks like your file is little-endian UTF-16, without a byte-order mark (BOM). Vim can detect this, but won't by default.
First, make sure your Vim is running with internal support for unicode. To do that, :set encoding=utf-8 at the top of your .vimrc. Next, Vim needs to be able to detect this file's encoding. The 'fileencodings' option controls this.
By default, when you set 'encoding' to utf-8, Vim's 'fileencodings' option contains "ucs-bom" which will detect UTF-16, but ONLY if a BOM is present. To also detect it when no BOM is present, you need to add your desired encoding to 'fileencodings'. It needs to come before any of the 8-bit encodings but after ucs-bom. Try doing this at the top of your .vimrc and restart Vim to use:
set encoding=utf-8
set fileencodings=ucs-bom,utf-16le,utf-8,default,latin1
Now loading files with the desired encoding should work just fine for editing, and therefore also for vimgrep.

How does vim know encoding of my files? When even I don't

I have been in charset-hell for days and vim somehow always shows the right charset for my file when even I'm not sure what they are (I'm dealing with files with identical content encoded in both charsets, mixed together)
I can see from inspecting the ü (u-umlaut) character in UTF-8 vs ISO-8859-1 which encoding I'm in, but I don't understand how vim figured it out - in those character-sets only the 'special characters' really look any different
If there is some other recording of the encoding/charset information I would love to know it
The explanation can be found under :help 'fileencodings':
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it. If all fail, 'fileencoding' is set to
an empty string, which means the value of 'encoding' is used.
So, there's no magic involved. When there's a Byte Order Mark in the file, that's easy. Else, Vim tries some other common encodings (which you can influence with that option; e.g. Japanese people will probably include something like sjis if they frequently edit such encoded files).
If you want a more intelligent detection, there are plugins for that, e.g. AutoFenc - Tries to automatically detect and set file encoding.

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?
As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!
The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}
You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

Resources