How does vim know encoding of my files? When even I don't - vim

I have been in charset-hell for days and vim somehow always shows the right charset for my file when even I'm not sure what they are (I'm dealing with files with identical content encoded in both charsets, mixed together)
I can see from inspecting the ü (u-umlaut) character in UTF-8 vs ISO-8859-1 which encoding I'm in, but I don't understand how vim figured it out - in those character-sets only the 'special characters' really look any different
If there is some other recording of the encoding/charset information I would love to know it

The explanation can be found under :help 'fileencodings':
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it. If all fail, 'fileencoding' is set to
an empty string, which means the value of 'encoding' is used.
So, there's no magic involved. When there's a Byte Order Mark in the file, that's easy. Else, Vim tries some other common encodings (which you can influence with that option; e.g. Japanese people will probably include something like sjis if they frequently edit such encoded files).
If you want a more intelligent detection, there are plugins for that, e.g. AutoFenc - Tries to automatically detect and set file encoding.

Related

How does VIM perform charset conversion?

I see the following paragraph on the vim documentation for the introduction of charset conversion:
Vim will automatically convert from one to another encoding in several places:
- When reading a file and 'fileencoding' is different from 'encoding'
- When writing a file and 'fileencoding' is different from 'encoding'
- When displaying characters and 'termencoding' is different from 'encoding'
- When reading input and 'termencoding' is different from 'encoding'
- When displaying messages and the encoding used for LC_MESSAGES differs from
'encoding' (requires a gettext version that supports this).
- When reading a Vim script where |:scriptencoding| is different from
'encoding'.
- When reading or writing a |viminfo| file.
I want to know who is converting to who? such as:
"When reading a file and 'fileencoding' is different from 'encoding'"
Is 'fileencoding' converted to 'encoding'? Or is 'encoding' converted to 'fileencoding'?
What is the relationship between the actual charset of the file and fileencoding and encoding?
If the actual charset of the file and the value of fileencoding are not equal, will the above conversion operations destroy the contents of the file?
UPDATE:
For example: the value of encoding is: utf-8 , vim opens a file: foo, and based on fileencodings matches a fileencoding value: sjis (assuming i don't know the actual encoding of this file.), I edited foo and used ":wq" to save and close the vim window. If I open the foo file again, is the actualencoding of this file the sjis specified by fileencoding or the utf-8 specified by encoding when I last edited?
'encoding' is the internal representation of any buffer text inside Vim; this is what Vim is working on. When you're dealing with different character sets (or if you don't care and work on a modern operating system), it's highly recommended to set this to utf-8, as the Unicode encoding ensures that any character can be represented and no information is lost. (And UTF-8 is the only Unicode representation that Vim internally supports; i.e. you cannot make it use a double-byte encoding like UTF-16.)
When you open a file in Vim, the list of possible encodings in 'fileencodings' (note the plural!) is considered:
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it.
So if a file doesn't look right, this is the option to tweak; alternatively, you can explicitly override the detection via the ++enc argument, e.g.
:edit ++enc=sjis japanese.txt
Now, Vim has the file's source encoding (persisted in (singular!) 'fileencoding'; this is needed for writing it back in the original encoding), and converts the character set (if different) to it's internal 'encoding'. All Vim commands operate on that, and on :write, the conversion happens in reverse (or optionally overridden by :w ++enc=...).
Conclusions
As long as the detected / passed encoding is right, and assuming the internal 'encoding' is able to represent all read characters (guaranteed with utf-8), there will be no data loss.
Likewise, as the original encoding is stored in 'fileencoding', writes of the file transparently convert back. Now, it could have happened that editing introduced a character that cannot be represented in the file's encoding (but you were able to edit it in because of Vim's internal Unicode encoding). Vim will then print E513: write error, conversion failed on writing, and you have to manually change the character(s), or choose a different target file encoding.
Example
A file with these Kanji characters 日本 is represented as follows in the SJIS encoding:
93fa 967b 0a
Each Kanji is stored in two bytes, and then you have the one-byte newline (LF) at the end.
With :set encoding=utf-8, this is represented internally as (g8 can tell you this):
e697 a5e6 9cac 0a
In UTF-8, each Kanji is stored in three bytes, the first Kanji is e6 97 a5.
Now if I edit the text, e.g. enclosing with (ASCII) parentheses, and :write, I get this:
2893 fa96 7b29 0a
The original SJIS encoding is restored, each Kanji is two bytes again, now with the added parentheses 28 and 29 around it.
Had I tried to edit in a ä character, the :write would have failed with the E513 error, as that character cannot be represented in SJIS.

(VIM) Is vimgrep capable of searching unicode string

Is vimgrep capable of searching unicode strings?
For example:
a.txt contains wide string "hello", vimgrep hello *.txt found nothing, and of course it's in the right path.
"Unicode" is a bit misleading in this case. What you have is not at all typical of text "encoded in accordance with any of the method provided by the Unicode standard". It's a bunch of normal characters with normal code points separated with NULL characters with code point 0000 or 00. Some Java programs do output that kind of garbage.
So, if your search pattern is hello, Vim and :vim are perfectly capable of searching for and finding hello (without NULLs) but they won't ever find hello (with NULLs).
Searching for h^#e^#l^#l^#o (^# is <C-v><C-#>), on the other hand, will find hello (with NULLs) but not hello (without NULLs).
Anyway, converting that file/buffer or making sure you don't end up with such a garbage are much better long-term solutions.
If Vim can detect the encoding of the file, then yes, Vim can grep the file. :vimgrep works by first reading in the file as normal (even including autocmds) into a hidden buffer, and then searching the buffer.
It looks like your file is little-endian UTF-16, without a byte-order mark (BOM). Vim can detect this, but won't by default.
First, make sure your Vim is running with internal support for unicode. To do that, :set encoding=utf-8 at the top of your .vimrc. Next, Vim needs to be able to detect this file's encoding. The 'fileencodings' option controls this.
By default, when you set 'encoding' to utf-8, Vim's 'fileencodings' option contains "ucs-bom" which will detect UTF-16, but ONLY if a BOM is present. To also detect it when no BOM is present, you need to add your desired encoding to 'fileencodings'. It needs to come before any of the 8-bit encodings but after ucs-bom. Try doing this at the top of your .vimrc and restart Vim to use:
set encoding=utf-8
set fileencodings=ucs-bom,utf-16le,utf-8,default,latin1
Now loading files with the desired encoding should work just fine for editing, and therefore also for vimgrep.

How can I find the character code of a special character in my text editor?

When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.

Vim UTF-8 encoding error on Windows

I have a text file with Polish characters. As long as I do not set :set encoding=utf-8 the characters are not displayed correctly. As soon as I set it to Unicode the characters are displayed but umlauts in error messages in Vim on the other hand are not displayed anymore.
Example:
E37: Kein Schreibvorgang seit der letzten <c4>nderung (erzwinge mit !)
Instead of the <c4> there should be the character Ä displayed. Can anybody explain me why this happens?
I'm experiencing similar issues (you can view some of the questions in my account info, or search for "central european characters" or "croatian characters").
Changing the encoding value changes the way Vim displays the characters - so, the way some of the characters are displayed is changed - that's why you're getting characters. You could probably solve your problem of Polish characters by choosing some other encoding value (one of the cpXXXX for example instead of utf8), but then you would lose the ability to display utf8 characters which can make Vim rather pretty. At least this works for my case (Croatian).
So, either use while writing polish texts one of the cpXXXX encoding values, or stick to utf8 completely. I recommend the first one. But do not change them.
Still working on that here.

Problem printing central european characters in Vim

Here's the problem in a nutshell.
I wrote a text file which I need to print out (in a hurry) that contains central european characters (šđčćž/ŠĐČĆŽ).
Vim's encoding settings are as folows;
set encoding=cp1250
set fileencoding=
Upon printing out comes garbage. What should be changed to fix that?
Really hate Vim's frekin' 1001 options in a time like this. Can't it do a simple thing and just print what's on screen?!
Check the option printencoding.
The help says it's empty by default, and when the encoding is multi-byte Vim tries to convert them to the printencoding. Plus, if it's empty "the conversion will be to latin1". This is what may be causing the trouble.
I'd like to ask: why not to use UTF-8?

Resources