How to display UTF-8 characters in Vim correctly - vim

I want/need to edit files with UTF-8 characters in it and I want to use Vim for it.
Before I get accused of asking something that was asked before, I've read the Vim documentation on encoding, fileencoding[s], termencoding and more, googled the subject, and read this question among other texts.
Here is a sentence with a UTF-8 character in it that I use as a test case.
From Japanese 勝 (katsu) meaning "victory"
If I open the (UTF-8) file with Notepad it is displayed correct.
When I open it with Vim, the best thing I get is a black square where the Japanese character for katsu should be.
Changing any of the settings for fileencoding or encoding does not make a difference.
Why is Vim giving me a black square where Notepad is displaying it without problems? If I copy the text from Vim with copy/paste to Notepad it is displayed correctly, indicating that the text is not corrupted but displayed wrong. But what setting(s) have influence on that?
Here is the relevant part of my _vimrc:
if has("multi_byte")
set encoding=utf-8
if &termencoding == ""
let &termencoding = &encoding
endif
setglobal fileencoding=utf-8
set fileencodings=ucs-bom,utf-8,latin1
endif
The actual settings when I open the file are:
encoding=utf-8
fileencoding=utf-8
termencoding=utf-8
My PC is running Windows 10, language is English (United States).
This is what the content of the file looks like after loading it in Vim and converting it to hex:
0000000: efbb bf46 726f 6d20 4a61 7061 6e65 7365 ...From Japanese
0000010: 20e5 8b9d 2028 6b61 7473 7529 206d 6561 ... (katsu) mea
0000020: 6e69 6e67 2022 7669 6374 6f72 7922 0d0a ning "victory"..
The first to bytes is the Microsoft BOM magic, the rest is just like ASCII except for the second, third and fourth byte on the second line, which must represent the non-ASCII character somehow.

There are two steps to make Vim successfully display a UTF-8 character:
File encoding. You've correctly identified that this is controlled by the 'encoding' and 'fileencodings' options. Once you've properly set this up (which you can verify via :setlocal filenencoding?, or the ga command on a known character, or at least by checking that each character is represented by a single cell, not its constituent byte values), there's:
Character display. That is, you need to use a font that contains the UTF-8 glyphs. UTF-8 is large; most fonts don't contain all glyphs. In my experience, that's less of a problem on Linux, which seems to have some automatic fallbacks built in. But on Windows, you need to have a proper font installed and configured (gVim: in guifont).
For example, to properly display Japanese Kanji characters, you need to install the far eastern language support in Windows, and then
:set guifont=MS_Gothic:h12:cSHIFTJIS

Related

Editing in binary

I've been tinkering with multiple hex editors but nothing really has worked.
What I'm looking for is a way to change a binary in actual binary (not in hex). This is purely for educational purposes and I know it's trivial to convert between both, but I wanted to be able to change the ones and zeroes just like I would do hex.
I've tried using vim with the %!xxd -b but then it won't work with %!xxd -r. I know how to convert the file into binary, but I'm looking for a way to dynamically change it in this format and being able to save it.
Better yet would be if I could find a way to actually create a binary by coding purely in actual binary.
Any help would be appreciated :D
vim or gvim should work for you directly, without the xxd filter.
Open the file in (g)vim. Place your cursor on a character and type ga to see its character code in the status line. To insert character NNN, place your cursor where you want it, go in insert mode and type Ctrl-v and then the three digit decimal code value. Use Ctrl-v x HH to enter the character by its hexadecimal code.
Make sure your terminal is not set to use UTF8, because in UTF8, typing Ctrl-v 128 will in fact insert c280, the utf-8 encoding of character 128, instead of 80.
LC_ALL=C vim binary-file
is the easiest way to make sure you're doing binary character based editing in vim, but that might do weird things if your terminal is utf-8.
LC_ALL=C gvim binary-file
should open a stand-alone window with proper display.
FYI, if you did want to work in utf-8, Ctrl-v u HHHH is how to enter the Unicode character with Hex code point HHHH.
windows
open cmd.exe or notepad++ or whatever editor
enable numlock key
On laptops you need to use the function key or the blue / grey silver numbers above alphabet keys (using the numbers on the top line will not work as they map to different scan code.
press alt key + 255 will correspond to 0xff
press alt key + 254 will correspond to 0xfe
see below for a demo
C:\>copy con rawbin.bin
 ■²ⁿ√·∙⌂~}─^Z
^Z
1 file(s) copied.
C:\>xxd rawbin.bin
0000000: fffe fdfc fbfa f97f 7e7d c41a 0d0a ........~}....
C:\>

How to insert Unicode character U+2611 in gvim

when I try to enter this Unicode character :☑(U+2611) in vim using the command like : ^Vu2611 (which means press ctrl+V then type u2611 in insert mode),Vim somehow breaks it into two characters : &(26) and ^Q(11).
There's no any problem when I tried to insert other kind of characters like □ (U+a1f5).
It seems like Vim stopped its parsing immediately after 26 (which represents character '&') has been read .
So,how can I insert this kind of Unicode characters in Vim (I have tried to paste it into Vim ,it doesn't work)?
Please Help!!!
In order to process Unicode characters, Vim must use an 'encoding' that is able to represent those characters. With a value of latin1, the mentioned character cannot be encoded (this 8-bit encoding only includes ASCII and several Western European characters, see here).
So, you need to
:set encoding=utf-8
With that, any newly created file will use that encoding, and you should be able to insert Unicode characters and write them (also with another Unicode file encoding, like :w ++enc=ucs-2le; but if you tried to persist as :w ++enc=latin1, you'd get a CONVERSION ERROR).

vim doesn't display cp1252 characters

I'm running gvim, under Windows. I pasted some text from a web page but vim does not display the hyphen and smart quotes.
When I check the encoding that vim is using (:set enc) vim reports that it is using cp1252.
When I check the hex value of the codes under the cursor (ga) vim reports the correct cp1252 code values (0x96, 0x93, and 0x94).
And yet it does display the smart single quotes (0x91 and 0x92)
Can anyone explain what is happening?
Thanks,
Steve
The "DejaVu_Sans_Mono" and "Lucinda_Console" fonts both displayed the characters with codes between 0xA0 and 0x9F correctly. There are probably others that do so as well. I'm using gvim, so I selected them from the menubar (Edit/Select Font). Subsequently I added the line "set guifont=DejaVu_Sans_Mono:h12" to my _vimrc file. The "h12" specifies font size. – justerman

(VIM) Is vimgrep capable of searching unicode string

Is vimgrep capable of searching unicode strings?
For example:
a.txt contains wide string "hello", vimgrep hello *.txt found nothing, and of course it's in the right path.
"Unicode" is a bit misleading in this case. What you have is not at all typical of text "encoded in accordance with any of the method provided by the Unicode standard". It's a bunch of normal characters with normal code points separated with NULL characters with code point 0000 or 00. Some Java programs do output that kind of garbage.
So, if your search pattern is hello, Vim and :vim are perfectly capable of searching for and finding hello (without NULLs) but they won't ever find hello (with NULLs).
Searching for h^#e^#l^#l^#o (^# is <C-v><C-#>), on the other hand, will find hello (with NULLs) but not hello (without NULLs).
Anyway, converting that file/buffer or making sure you don't end up with such a garbage are much better long-term solutions.
If Vim can detect the encoding of the file, then yes, Vim can grep the file. :vimgrep works by first reading in the file as normal (even including autocmds) into a hidden buffer, and then searching the buffer.
It looks like your file is little-endian UTF-16, without a byte-order mark (BOM). Vim can detect this, but won't by default.
First, make sure your Vim is running with internal support for unicode. To do that, :set encoding=utf-8 at the top of your .vimrc. Next, Vim needs to be able to detect this file's encoding. The 'fileencodings' option controls this.
By default, when you set 'encoding' to utf-8, Vim's 'fileencodings' option contains "ucs-bom" which will detect UTF-16, but ONLY if a BOM is present. To also detect it when no BOM is present, you need to add your desired encoding to 'fileencodings'. It needs to come before any of the 8-bit encodings but after ucs-bom. Try doing this at the top of your .vimrc and restart Vim to use:
set encoding=utf-8
set fileencodings=ucs-bom,utf-16le,utf-8,default,latin1
Now loading files with the desired encoding should work just fine for editing, and therefore also for vimgrep.

Vim UTF-8 encoding error on Windows

I have a text file with Polish characters. As long as I do not set :set encoding=utf-8 the characters are not displayed correctly. As soon as I set it to Unicode the characters are displayed but umlauts in error messages in Vim on the other hand are not displayed anymore.
Example:
E37: Kein Schreibvorgang seit der letzten <c4>nderung (erzwinge mit !)
Instead of the <c4> there should be the character Ä displayed. Can anybody explain me why this happens?
I'm experiencing similar issues (you can view some of the questions in my account info, or search for "central european characters" or "croatian characters").
Changing the encoding value changes the way Vim displays the characters - so, the way some of the characters are displayed is changed - that's why you're getting characters. You could probably solve your problem of Polish characters by choosing some other encoding value (one of the cpXXXX for example instead of utf8), but then you would lose the ability to display utf8 characters which can make Vim rather pretty. At least this works for my case (Croatian).
So, either use while writing polish texts one of the cpXXXX encoding values, or stick to utf8 completely. I recommend the first one. But do not change them.
Still working on that here.

Resources