(VIM) Is vimgrep capable of searching unicode string - vim

Is vimgrep capable of searching unicode strings?
For example:
a.txt contains wide string "hello", vimgrep hello *.txt found nothing, and of course it's in the right path.

"Unicode" is a bit misleading in this case. What you have is not at all typical of text "encoded in accordance with any of the method provided by the Unicode standard". It's a bunch of normal characters with normal code points separated with NULL characters with code point 0000 or 00. Some Java programs do output that kind of garbage.
So, if your search pattern is hello, Vim and :vim are perfectly capable of searching for and finding hello (without NULLs) but they won't ever find hello (with NULLs).
Searching for h^#e^#l^#l^#o (^# is <C-v><C-#>), on the other hand, will find hello (with NULLs) but not hello (without NULLs).
Anyway, converting that file/buffer or making sure you don't end up with such a garbage are much better long-term solutions.

If Vim can detect the encoding of the file, then yes, Vim can grep the file. :vimgrep works by first reading in the file as normal (even including autocmds) into a hidden buffer, and then searching the buffer.
It looks like your file is little-endian UTF-16, without a byte-order mark (BOM). Vim can detect this, but won't by default.
First, make sure your Vim is running with internal support for unicode. To do that, :set encoding=utf-8 at the top of your .vimrc. Next, Vim needs to be able to detect this file's encoding. The 'fileencodings' option controls this.
By default, when you set 'encoding' to utf-8, Vim's 'fileencodings' option contains "ucs-bom" which will detect UTF-16, but ONLY if a BOM is present. To also detect it when no BOM is present, you need to add your desired encoding to 'fileencodings'. It needs to come before any of the 8-bit encodings but after ucs-bom. Try doing this at the top of your .vimrc and restart Vim to use:
set encoding=utf-8
set fileencodings=ucs-bom,utf-16le,utf-8,default,latin1
Now loading files with the desired encoding should work just fine for editing, and therefore also for vimgrep.

Related

How to treat multibyte characters simply as a sequence of bytes?

I would like to use vim with binary files. I run run vim with -b and I have isprint = and display += uhex. I am using the following statusline:
%<%f\ %h%m%r%=%o\ (0x%06O)\ \ %3.b\ <%02B>\ %7P
so I get output containing some useful information like byte offset in the file and the current character in hex etc. But I'm having trouble with random pieced of data interpreted as multibyte characters which prevent me from accessing the inner bytes, combine with surroundings (including vim's decoration) or display as �.
Of course I have tried opening the files with ++enc=latin1. However, my system's encoding is UTF-8, so what vim supposedly does is convert the file from Latin-1 to UTF-8 internally and display that. This has two problems:
The sequence <c3><ac> displays as ì, rather than ì, but the characters count as two bytes each, so it breaks my %o and counts offsets wrong. This is 2 bytes in the file but apparently 4 bytes in vim's buffer.
I don't know why my isprint is ignored. Neither of these characters are between 32 and 126 so they should display in hex.
I found the following workaround: I set encoding to latin1, but termencoding to utf-8. This achieves what I want, but breaks other things like when vim needs to display status messages ("new file", "changed" etc.) in my language, because it wants to use the encoding for them too and they don't fit. I guess I could run vim in LC_ALL=C but it feels I'm resorting to too many dirty tricks already. Is there a better way, i.e., without having to mess around with encoding?

How do I change vims character set of stdin?

echo "UTF-16le text"|vim -
:set encoding=utf-16le
:set fileencoding=utf-16le
:e! ++enc=utf-16le
has absolutely no effect on the mojibake that is displayed on the screen. Though the last one (:e! ++enc=utf-16le) results in an error E32: No file name.
If I edit ~/.vimrc to set fileencodings=utf-16le,[...] then it works, but I shouldn't have to edit my configuration file every time I use vim, is there a better way? Preferably a way in which a key code will just cycle between my :set fileencodings, that way I can choose quickly if needed.
The command-line equivalent to ~/.vimrc is passing commands via --cmd. You can also employ the :help :set^= command to prepend a value to an option:
echo "UTF-16le text"|vim --cmd 'set fencs^=utf-16le' -
I shouldn't have to edit my configuration file every time I use vim
First, I would test whether permanently keeping utf-16le in 'fileencodings' has any negative consequences for any files you regularly edit; maybe you can safely keep it in by default.
Second, there are plugins like AutoFenc, which extends the built-in detection, and fencview, which let's you choose the encoding from a menu.
Alternative
The problem with UTF-16 encodings is well known, and the byte order mark is one solution to make it easy to detect those. With such a BOM, Vim will correctly detect the encoding out-of-the-box. If your input is missing the BOM, you can manually prepend it:
{ printf '\xFF\xFE'; echo "UTF-16le text"; } | vim -

How to display UTF-8 characters in Vim correctly

I want/need to edit files with UTF-8 characters in it and I want to use Vim for it.
Before I get accused of asking something that was asked before, I've read the Vim documentation on encoding, fileencoding[s], termencoding and more, googled the subject, and read this question among other texts.
Here is a sentence with a UTF-8 character in it that I use as a test case.
From Japanese 勝 (katsu) meaning "victory"
If I open the (UTF-8) file with Notepad it is displayed correct.
When I open it with Vim, the best thing I get is a black square where the Japanese character for katsu should be.
Changing any of the settings for fileencoding or encoding does not make a difference.
Why is Vim giving me a black square where Notepad is displaying it without problems? If I copy the text from Vim with copy/paste to Notepad it is displayed correctly, indicating that the text is not corrupted but displayed wrong. But what setting(s) have influence on that?
Here is the relevant part of my _vimrc:
if has("multi_byte")
set encoding=utf-8
if &termencoding == ""
let &termencoding = &encoding
endif
setglobal fileencoding=utf-8
set fileencodings=ucs-bom,utf-8,latin1
endif
The actual settings when I open the file are:
encoding=utf-8
fileencoding=utf-8
termencoding=utf-8
My PC is running Windows 10, language is English (United States).
This is what the content of the file looks like after loading it in Vim and converting it to hex:
0000000: efbb bf46 726f 6d20 4a61 7061 6e65 7365 ...From Japanese
0000010: 20e5 8b9d 2028 6b61 7473 7529 206d 6561 ... (katsu) mea
0000020: 6e69 6e67 2022 7669 6374 6f72 7922 0d0a ning "victory"..
The first to bytes is the Microsoft BOM magic, the rest is just like ASCII except for the second, third and fourth byte on the second line, which must represent the non-ASCII character somehow.
There are two steps to make Vim successfully display a UTF-8 character:
File encoding. You've correctly identified that this is controlled by the 'encoding' and 'fileencodings' options. Once you've properly set this up (which you can verify via :setlocal filenencoding?, or the ga command on a known character, or at least by checking that each character is represented by a single cell, not its constituent byte values), there's:
Character display. That is, you need to use a font that contains the UTF-8 glyphs. UTF-8 is large; most fonts don't contain all glyphs. In my experience, that's less of a problem on Linux, which seems to have some automatic fallbacks built in. But on Windows, you need to have a proper font installed and configured (gVim: in guifont).
For example, to properly display Japanese Kanji characters, you need to install the far eastern language support in Windows, and then
:set guifont=MS_Gothic:h12:cSHIFTJIS

How does vim know encoding of my files? When even I don't

I have been in charset-hell for days and vim somehow always shows the right charset for my file when even I'm not sure what they are (I'm dealing with files with identical content encoded in both charsets, mixed together)
I can see from inspecting the ü (u-umlaut) character in UTF-8 vs ISO-8859-1 which encoding I'm in, but I don't understand how vim figured it out - in those character-sets only the 'special characters' really look any different
If there is some other recording of the encoding/charset information I would love to know it
The explanation can be found under :help 'fileencodings':
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it. If all fail, 'fileencoding' is set to
an empty string, which means the value of 'encoding' is used.
So, there's no magic involved. When there's a Byte Order Mark in the file, that's easy. Else, Vim tries some other common encodings (which you can influence with that option; e.g. Japanese people will probably include something like sjis if they frequently edit such encoded files).
If you want a more intelligent detection, there are plugins for that, e.g. AutoFenc - Tries to automatically detect and set file encoding.

Using vim+LaTeX with Scandinavian characters

I want to create a lab write-up with LaTeX in Ubuntu, however my text includes Scandinavian characters and at present I have to type them in using /"a and "/o etc. Is it possible to get the latex-compiler to read these special characters when they are typed in as is? Additionally, I would like vim to "read" Finnish: Now when I open a .tex-document containing Scandinavian characters, they are not displayed at all in vim. How can I correct this?
For latex, use the inputenc option:
\usepackage[utf8]{inputenc}
Instead of utf8, you may use whatever else fits you, like latin1, as well.
Now the trick is to make your terminal run the same character encoding. It seems that it runs a character/input encoding that doesn't fit your input right now.
For this, refer to the "Locale" settings of your distribution. You can always check the locale settings in the terminal by issueing locale. These days, UTF8 locales are preferred as they work with every character imaginable. If your terminal's environment is set up correctly, vim should happily work with all your special characters without mourning.
To find out in which encoding Vim thinks the document is, try:
:set enc
To set the encoding to UTF-8, try:
:set enc=utf8
I can't help with vim, but for LaTeX I recommend you check out XeTeX, which is an extension of TeX that is designed to support Unicode input. XeTeX is now part of Texlive, so if you have TeX installed chances are you already have it.
I use the UCS unicode support: http://iamleeg.blogspot.com/2007/10/nice-looking-latex-unicode.html

Resources