How to setup vim properly for editing in utf-8 - vim

I've run into problems a few times because vim's encoding was set to latin1 by default and I didn't notice and assumed it was using utf-8. Now that I have, I'd like to set up vim so that it will do the right thing in all obvious cases, and use utf-8 by default.
What I'd like to avoid:
Forcing a file saved in some other encoding that would have worked before my changes to open as utf-8, resulting in gibberish.
Forcing a terminal that doesn't support multibyte characters (like the Windows XP one) to try to display them anyway, resulting in gibberish.
Interfering with other programs' ability to read or edit the files (I have a (perhaps unjustified) aversion to using a BOM by default because I am unclear on how likely it is to mess other programs up.)
Other issues that I don't know enough about to guess at (but hopefully you do!)
What I've got so far:
if has("multi_byte")
if &termencoding == ""
let &termencoding = &encoding
endif
set encoding=utf-8 " better default than latin1
setglobal fileencoding=utf-8 " change default file encoding when writing new files
"setglobal bomb " use a BOM when writing new files
set fileencodings=ucs-bom,utf-8,latin1 " order to check for encodings when reading files
endif
This is taken and slightly modified from the vim wiki. I moved the bomb from setglobal fileencoding to its own statement because otherwise it doesn't actually work. I also commented out that line because of my uncertainty towards BOMs.
What I'm looking for:
Possible pitfalls to avoid that I missed
Problems with the existing code
Links to anywhere this has been discussed / set out already
Ultimately, I'd like this to result in a no-thought-required copy/paste snippet that will set up vim for utf-8-by-default that will work across platforms.
EDIT: I've marked my own answer as accepted for now, as far as I can tell it works okay and accounts for all things it can reasonably account for. But it's not set in stone; if you have any new information please feel free to answer!

In response to sehe, I'll give a go at answering my own question! I removed the updates I made to the original question and have moved them to this answer. This is probably the better way to do it.
The answer:
if has("multi_byte")
if &termencoding == ""
let &termencoding = &encoding
endif
set encoding=utf-8 " better default than latin1
setglobal fileencoding=utf-8 " change default file encoding when writing new files
endif
I removed the bomb line because according to the BOM Wikipedia page it is not needed when using utf-8 and in fact defeats ASCII backwards compatibility. As long as ucs-bom is first in fileencodings, vim will be able to detect and handle existing files with BOMs, so it is not needed for that either.
I removed the fileencodings line because it is not needed in this case. From the Vim docs: When 'encoding' is set to a Unicode encoding, and 'fileencodings' was not set yet, the default for 'fileencodings' is changed.
I am using setglobal filencoding (as opposed to set fileencoding) because:
When reading a file, fileencoding will be automatically set based on fileencodings. So it only matters for new files then. And according to the docs again:
For a new file the global value of
'fileencoding' is used.

I think it would suffice to have a vanilla vimrc + fenc=utf-8
The rest should be pretty decent out-of-the-box
I'd use the BOM only on Windows platforms with Microsoft tooling (although even some of these fail to always write a BOM; however it is the default for Notepad Unicode saving, .NET XmlWriter and other central points of the MS platform tools)

Related

How do I change vims character set of stdin?

echo "UTF-16le text"|vim -
:set encoding=utf-16le
:set fileencoding=utf-16le
:e! ++enc=utf-16le
has absolutely no effect on the mojibake that is displayed on the screen. Though the last one (:e! ++enc=utf-16le) results in an error E32: No file name.
If I edit ~/.vimrc to set fileencodings=utf-16le,[...] then it works, but I shouldn't have to edit my configuration file every time I use vim, is there a better way? Preferably a way in which a key code will just cycle between my :set fileencodings, that way I can choose quickly if needed.
The command-line equivalent to ~/.vimrc is passing commands via --cmd. You can also employ the :help :set^= command to prepend a value to an option:
echo "UTF-16le text"|vim --cmd 'set fencs^=utf-16le' -
I shouldn't have to edit my configuration file every time I use vim
First, I would test whether permanently keeping utf-16le in 'fileencodings' has any negative consequences for any files you regularly edit; maybe you can safely keep it in by default.
Second, there are plugins like AutoFenc, which extends the built-in detection, and fencview, which let's you choose the encoding from a menu.
Alternative
The problem with UTF-16 encodings is well known, and the byte order mark is one solution to make it easy to detect those. With such a BOM, Vim will correctly detect the encoding out-of-the-box. If your input is missing the BOM, you can manually prepend it:
{ printf '\xFF\xFE'; echo "UTF-16le text"; } | vim -

(VIM) Is vimgrep capable of searching unicode string

Is vimgrep capable of searching unicode strings?
For example:
a.txt contains wide string "hello", vimgrep hello *.txt found nothing, and of course it's in the right path.
"Unicode" is a bit misleading in this case. What you have is not at all typical of text "encoded in accordance with any of the method provided by the Unicode standard". It's a bunch of normal characters with normal code points separated with NULL characters with code point 0000 or 00. Some Java programs do output that kind of garbage.
So, if your search pattern is hello, Vim and :vim are perfectly capable of searching for and finding hello (without NULLs) but they won't ever find hello (with NULLs).
Searching for h^#e^#l^#l^#o (^# is <C-v><C-#>), on the other hand, will find hello (with NULLs) but not hello (without NULLs).
Anyway, converting that file/buffer or making sure you don't end up with such a garbage are much better long-term solutions.
If Vim can detect the encoding of the file, then yes, Vim can grep the file. :vimgrep works by first reading in the file as normal (even including autocmds) into a hidden buffer, and then searching the buffer.
It looks like your file is little-endian UTF-16, without a byte-order mark (BOM). Vim can detect this, but won't by default.
First, make sure your Vim is running with internal support for unicode. To do that, :set encoding=utf-8 at the top of your .vimrc. Next, Vim needs to be able to detect this file's encoding. The 'fileencodings' option controls this.
By default, when you set 'encoding' to utf-8, Vim's 'fileencodings' option contains "ucs-bom" which will detect UTF-16, but ONLY if a BOM is present. To also detect it when no BOM is present, you need to add your desired encoding to 'fileencodings'. It needs to come before any of the 8-bit encodings but after ucs-bom. Try doing this at the top of your .vimrc and restart Vim to use:
set encoding=utf-8
set fileencodings=ucs-bom,utf-16le,utf-8,default,latin1
Now loading files with the desired encoding should work just fine for editing, and therefore also for vimgrep.

How does vim know encoding of my files? When even I don't

I have been in charset-hell for days and vim somehow always shows the right charset for my file when even I'm not sure what they are (I'm dealing with files with identical content encoded in both charsets, mixed together)
I can see from inspecting the ΓΌ (u-umlaut) character in UTF-8 vs ISO-8859-1 which encoding I'm in, but I don't understand how vim figured it out - in those character-sets only the 'special characters' really look any different
If there is some other recording of the encoding/charset information I would love to know it
The explanation can be found under :help 'fileencodings':
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it. If all fail, 'fileencoding' is set to
an empty string, which means the value of 'encoding' is used.
So, there's no magic involved. When there's a Byte Order Mark in the file, that's easy. Else, Vim tries some other common encodings (which you can influence with that option; e.g. Japanese people will probably include something like sjis if they frequently edit such encoded files).
If you want a more intelligent detection, there are plugins for that, e.g. AutoFenc - Tries to automatically detect and set file encoding.

Opening UCS-2le File With Vim on Windows

I'm using Vim 7.3 on WinXP. I use XML files that are generated by an application at my work which writes them with UCS-2le encoding. After reading several articles on encoding at the vim wiki I found the following advice given, namely to set my file encoding in vimrc:
set fileencodings=ucs-bom,utf-8
The file in question has FF EE as the first characters (confirmed viewing with HxD), but Vim doesn't open it properly. I can open my UCS-2le files properly with this in my vimrc:
set fileencodings=ucs-2le, utf-8
But now my UTF-8 files are a mess!
Any advice how to proceed? I typically run Gvim without behave MSwin (if that matters). I use very few plugins. My actual vimrc setting regarding file encodings are:
set encoding=utf-8
set fileencodings=ucs-bom,utf-8,ucs-2le,latin1
The entry for ucs-2le in the third spot seems to make no difference. As I understand it, the first entry (set encoding) is the encoding Vim uses internally in its buffer, while the second (set fileencodings) deals with the encoding of the file when vim reads and writes it. So, it seems to me that since the file has a byte order mark, ucs-bom as the first entry in setfileencodings should catch it. As far I can tell, it seems that vim doesn't recognize that this file is 16 bytes per character.
Note: I can/do solve the problem in the meantime by manually setting the file encoding when I open my ucs-2le files:
edit ++enc=ucs2-le
Cheers.
Solved it. I am not sure what I did but the fixes noted are working perfectly now to read and write my UCS-2 files - though for unknown reason not immediately (did I just restart Vim?). I could try to reverse the fixes to see which one was the critical change but here's what I've done (see also my comments on Jul 27 above):
Put AutoFenc.vim plugin in my plugins folder (automatically detects file encoding (AutoFenc.vim).
Added iconv.dll and new version of libintl.dll to my vim73 folder (Vim.org)
Edited vimrc as below
vimrc now contains (the last bits just make it easier to see what's happening with file encodings by showing the file encoding in the status line):
"use utf-8 by default
set encoding=utf-8
set fileencodings=ucs-bom,utf-8,ucs-2le,latin1
"always show status line
set laststatus=2
"show encoding in status line http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
if has("statusline")
set statusline=%<%f\ %h%m%r%=%{\"[\".(&fenc==\"\"?&enc:&fenc).((exists(\"+bomb\")\ &&\ &bomb)?\",B\":\"\").\"]\ \"}%k\ %-14.(%l,%c%V%)\ %P
endif
And all is well.

Using vim+LaTeX with Scandinavian characters

I want to create a lab write-up with LaTeX in Ubuntu, however my text includes Scandinavian characters and at present I have to type them in using /"a and "/o etc. Is it possible to get the latex-compiler to read these special characters when they are typed in as is? Additionally, I would like vim to "read" Finnish: Now when I open a .tex-document containing Scandinavian characters, they are not displayed at all in vim. How can I correct this?
For latex, use the inputenc option:
\usepackage[utf8]{inputenc}
Instead of utf8, you may use whatever else fits you, like latin1, as well.
Now the trick is to make your terminal run the same character encoding. It seems that it runs a character/input encoding that doesn't fit your input right now.
For this, refer to the "Locale" settings of your distribution. You can always check the locale settings in the terminal by issueing locale. These days, UTF8 locales are preferred as they work with every character imaginable. If your terminal's environment is set up correctly, vim should happily work with all your special characters without mourning.
To find out in which encoding Vim thinks the document is, try:
:set enc
To set the encoding to UTF-8, try:
:set enc=utf8
I can't help with vim, but for LaTeX I recommend you check out XeTeX, which is an extension of TeX that is designed to support Unicode input. XeTeX is now part of Texlive, so if you have TeX installed chances are you already have it.
I use the UCS unicode support: http://iamleeg.blogspot.com/2007/10/nice-looking-latex-unicode.html

Resources