Changing default encoding of vim to utf-8 not working - linux

I've been trying to change the default encoding of vim by writing in the $HOME/.vimrc file the following lines
set fileencodings=utf-8
set fileencoding=utf-8
I understand that the first line makes vim try to read a file with utf-8 encoding and the second line makes vim save files always with an utf-8 encoding. But when I write a file called example.txt and write no special characters, I save it to verify
file -i example.txt
output:
example.txt: text/plain; charset=us-ascii
If I wrote then special characters on example.txt then it would display correctly
example.txt: text/plain; charset=utf-8
But I want always the encoding to be utf-8 even if the file does not have special characters. Why is it not working?

file looks at the content of the file to determine its encoding. If it only finds ASCII characters it can only conclude that the file is ASCII.
ASCII being a subset of UTF-8 (the basis of a number of other encodings), it is simply impossible for any program to tell if 123abc is anything other than ASCII.
Of course, if you add UTF-8 characters to that file, file will spot them and act accordingly.
So… on to the Vim side of the "problem".
fileencodings is a list of encodings considered by Vim when reading a file.
fileencoding is the encoding used by Vim when writing a specific file.
The default value of both options depends on the value of encoding, which is set during startup.
The "ideal" encoding is utf-8. With this, you get a sensible list for fileencodings: ucs-bom,utf-8,default,latin1, and a sensible value for fileencoding: utf-8 that pretty much guarantee a smooth experience as long as you stay within the confines of UTF-8.
See:
:help 'encoding'
:help 'fileencodings'
:help 'fileencoding'

Related

How does VIM perform charset conversion?

I see the following paragraph on the vim documentation for the introduction of charset conversion:
Vim will automatically convert from one to another encoding in several places:
- When reading a file and 'fileencoding' is different from 'encoding'
- When writing a file and 'fileencoding' is different from 'encoding'
- When displaying characters and 'termencoding' is different from 'encoding'
- When reading input and 'termencoding' is different from 'encoding'
- When displaying messages and the encoding used for LC_MESSAGES differs from
'encoding' (requires a gettext version that supports this).
- When reading a Vim script where |:scriptencoding| is different from
'encoding'.
- When reading or writing a |viminfo| file.
I want to know who is converting to who? such as:
"When reading a file and 'fileencoding' is different from 'encoding'"
Is 'fileencoding' converted to 'encoding'? Or is 'encoding' converted to 'fileencoding'?
What is the relationship between the actual charset of the file and fileencoding and encoding?
If the actual charset of the file and the value of fileencoding are not equal, will the above conversion operations destroy the contents of the file?
UPDATE:
For example: the value of encoding is: utf-8 , vim opens a file: foo, and based on fileencodings matches a fileencoding value: sjis (assuming i don't know the actual encoding of this file.), I edited foo and used ":wq" to save and close the vim window. If I open the foo file again, is the actualencoding of this file the sjis specified by fileencoding or the utf-8 specified by encoding when I last edited?
'encoding' is the internal representation of any buffer text inside Vim; this is what Vim is working on. When you're dealing with different character sets (or if you don't care and work on a modern operating system), it's highly recommended to set this to utf-8, as the Unicode encoding ensures that any character can be represented and no information is lost. (And UTF-8 is the only Unicode representation that Vim internally supports; i.e. you cannot make it use a double-byte encoding like UTF-16.)
When you open a file in Vim, the list of possible encodings in 'fileencodings' (note the plural!) is considered:
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it.
So if a file doesn't look right, this is the option to tweak; alternatively, you can explicitly override the detection via the ++enc argument, e.g.
:edit ++enc=sjis japanese.txt
Now, Vim has the file's source encoding (persisted in (singular!) 'fileencoding'; this is needed for writing it back in the original encoding), and converts the character set (if different) to it's internal 'encoding'. All Vim commands operate on that, and on :write, the conversion happens in reverse (or optionally overridden by :w ++enc=...).
Conclusions
As long as the detected / passed encoding is right, and assuming the internal 'encoding' is able to represent all read characters (guaranteed with utf-8), there will be no data loss.
Likewise, as the original encoding is stored in 'fileencoding', writes of the file transparently convert back. Now, it could have happened that editing introduced a character that cannot be represented in the file's encoding (but you were able to edit it in because of Vim's internal Unicode encoding). Vim will then print E513: write error, conversion failed on writing, and you have to manually change the character(s), or choose a different target file encoding.
Example
A file with these Kanji characters 日本 is represented as follows in the SJIS encoding:
93fa 967b 0a
Each Kanji is stored in two bytes, and then you have the one-byte newline (LF) at the end.
With :set encoding=utf-8, this is represented internally as (g8 can tell you this):
e697 a5e6 9cac 0a
In UTF-8, each Kanji is stored in three bytes, the first Kanji is e6 97 a5.
Now if I edit the text, e.g. enclosing with (ASCII) parentheses, and :write, I get this:
2893 fa96 7b29 0a
The original SJIS encoding is restored, each Kanji is two bytes again, now with the added parentheses 28 and 29 around it.
Had I tried to edit in a ä character, the :write would have failed with the E513 error, as that character cannot be represented in SJIS.

How to display UTF-8 characters in Vim correctly

I want/need to edit files with UTF-8 characters in it and I want to use Vim for it.
Before I get accused of asking something that was asked before, I've read the Vim documentation on encoding, fileencoding[s], termencoding and more, googled the subject, and read this question among other texts.
Here is a sentence with a UTF-8 character in it that I use as a test case.
From Japanese 勝 (katsu) meaning "victory"
If I open the (UTF-8) file with Notepad it is displayed correct.
When I open it with Vim, the best thing I get is a black square where the Japanese character for katsu should be.
Changing any of the settings for fileencoding or encoding does not make a difference.
Why is Vim giving me a black square where Notepad is displaying it without problems? If I copy the text from Vim with copy/paste to Notepad it is displayed correctly, indicating that the text is not corrupted but displayed wrong. But what setting(s) have influence on that?
Here is the relevant part of my _vimrc:
if has("multi_byte")
set encoding=utf-8
if &termencoding == ""
let &termencoding = &encoding
endif
setglobal fileencoding=utf-8
set fileencodings=ucs-bom,utf-8,latin1
endif
The actual settings when I open the file are:
encoding=utf-8
fileencoding=utf-8
termencoding=utf-8
My PC is running Windows 10, language is English (United States).
This is what the content of the file looks like after loading it in Vim and converting it to hex:
0000000: efbb bf46 726f 6d20 4a61 7061 6e65 7365 ...From Japanese
0000010: 20e5 8b9d 2028 6b61 7473 7529 206d 6561 ... (katsu) mea
0000020: 6e69 6e67 2022 7669 6374 6f72 7922 0d0a ning "victory"..
The first to bytes is the Microsoft BOM magic, the rest is just like ASCII except for the second, third and fourth byte on the second line, which must represent the non-ASCII character somehow.
There are two steps to make Vim successfully display a UTF-8 character:
File encoding. You've correctly identified that this is controlled by the 'encoding' and 'fileencodings' options. Once you've properly set this up (which you can verify via :setlocal filenencoding?, or the ga command on a known character, or at least by checking that each character is represented by a single cell, not its constituent byte values), there's:
Character display. That is, you need to use a font that contains the UTF-8 glyphs. UTF-8 is large; most fonts don't contain all glyphs. In my experience, that's less of a problem on Linux, which seems to have some automatic fallbacks built in. But on Windows, you need to have a proper font installed and configured (gVim: in guifont).
For example, to properly display Japanese Kanji characters, you need to install the far eastern language support in Windows, and then
:set guifont=MS_Gothic:h12:cSHIFTJIS

(VIM) Is vimgrep capable of searching unicode string

Is vimgrep capable of searching unicode strings?
For example:
a.txt contains wide string "hello", vimgrep hello *.txt found nothing, and of course it's in the right path.
"Unicode" is a bit misleading in this case. What you have is not at all typical of text "encoded in accordance with any of the method provided by the Unicode standard". It's a bunch of normal characters with normal code points separated with NULL characters with code point 0000 or 00. Some Java programs do output that kind of garbage.
So, if your search pattern is hello, Vim and :vim are perfectly capable of searching for and finding hello (without NULLs) but they won't ever find hello (with NULLs).
Searching for h^#e^#l^#l^#o (^# is <C-v><C-#>), on the other hand, will find hello (with NULLs) but not hello (without NULLs).
Anyway, converting that file/buffer or making sure you don't end up with such a garbage are much better long-term solutions.
If Vim can detect the encoding of the file, then yes, Vim can grep the file. :vimgrep works by first reading in the file as normal (even including autocmds) into a hidden buffer, and then searching the buffer.
It looks like your file is little-endian UTF-16, without a byte-order mark (BOM). Vim can detect this, but won't by default.
First, make sure your Vim is running with internal support for unicode. To do that, :set encoding=utf-8 at the top of your .vimrc. Next, Vim needs to be able to detect this file's encoding. The 'fileencodings' option controls this.
By default, when you set 'encoding' to utf-8, Vim's 'fileencodings' option contains "ucs-bom" which will detect UTF-16, but ONLY if a BOM is present. To also detect it when no BOM is present, you need to add your desired encoding to 'fileencodings'. It needs to come before any of the 8-bit encodings but after ucs-bom. Try doing this at the top of your .vimrc and restart Vim to use:
set encoding=utf-8
set fileencodings=ucs-bom,utf-16le,utf-8,default,latin1
Now loading files with the desired encoding should work just fine for editing, and therefore also for vimgrep.

How does vim know encoding of my files? When even I don't

I have been in charset-hell for days and vim somehow always shows the right charset for my file when even I'm not sure what they are (I'm dealing with files with identical content encoded in both charsets, mixed together)
I can see from inspecting the ü (u-umlaut) character in UTF-8 vs ISO-8859-1 which encoding I'm in, but I don't understand how vim figured it out - in those character-sets only the 'special characters' really look any different
If there is some other recording of the encoding/charset information I would love to know it
The explanation can be found under :help 'fileencodings':
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it. If all fail, 'fileencoding' is set to
an empty string, which means the value of 'encoding' is used.
So, there's no magic involved. When there's a Byte Order Mark in the file, that's easy. Else, Vim tries some other common encodings (which you can influence with that option; e.g. Japanese people will probably include something like sjis if they frequently edit such encoded files).
If you want a more intelligent detection, there are plugins for that, e.g. AutoFenc - Tries to automatically detect and set file encoding.

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?
As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!
The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}
You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

Resources