How to substitute cp1250 specific characters to utf-8 in Vim - vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?

As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!

The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}

You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

Related

Confirming the encoding of a file

I am outputting a file from SSIS in UTF-8 Encoding.
This file is passed to a third party for import into their system.
They are having a problem importing this file. Although they requested UTF-8 encoding, it seems they convert the encoding to ISO-8859-1. They use this command to convert the files encoding:
iconv -f UTF-8 -t ISO-8859-1 dweyr.inp
They are receiving this error
illegal input sequence at position 11
The piece of text causing the issue is:
ark O’Dwy
I think its the apostrophe, or whatever version of an apostrophe is used in this text.
The problem i face is that every text editor i try tells me the file is UTF-8 and renders it correctly.
The vendor is saying that this char is not UTF-8.
How can i confirm whom is correct?
The error message by iconv is a bit misleading, but kind-of correct.
It doesn't tell you that the input isn't valid UTF-8, but that it cannot be converted to ISO-8859-1 in a lossless way. ISO-8859-1 does not have a way to encode the ’ character.
Verify that by executing this command:
echo "ark O’Dwy" | iconv -f UTF-8 -t UTF-7
This produces the output that looks like "ark O+IBk-Dwy".
Here I'm outputting to UTF-7 (a very rarely used encoding that is useful for demonstration here, but little else).
In other words: the encoding is only "illegal" in the sense that it cannot be converted to ISO-8859-1, but it's a perfectly valid UTF-8 sequence.
If the third party claims to support UTF-8, then they may do so only very superficially. They might support any text that can be encoded in ISO-8859-1 as long as it's encoded in UTF-8 (which is an extremely low level of "UTF-8 support").

How to treat multibyte characters simply as a sequence of bytes?

I would like to use vim with binary files. I run run vim with -b and I have isprint = and display += uhex. I am using the following statusline:
%<%f\ %h%m%r%=%o\ (0x%06O)\ \ %3.b\ <%02B>\ %7P
so I get output containing some useful information like byte offset in the file and the current character in hex etc. But I'm having trouble with random pieced of data interpreted as multibyte characters which prevent me from accessing the inner bytes, combine with surroundings (including vim's decoration) or display as �.
Of course I have tried opening the files with ++enc=latin1. However, my system's encoding is UTF-8, so what vim supposedly does is convert the file from Latin-1 to UTF-8 internally and display that. This has two problems:
The sequence <c3><ac> displays as ì, rather than ì, but the characters count as two bytes each, so it breaks my %o and counts offsets wrong. This is 2 bytes in the file but apparently 4 bytes in vim's buffer.
I don't know why my isprint is ignored. Neither of these characters are between 32 and 126 so they should display in hex.
I found the following workaround: I set encoding to latin1, but termencoding to utf-8. This achieves what I want, but breaks other things like when vim needs to display status messages ("new file", "changed" etc.) in my language, because it wants to use the encoding for them too and they don't fit. I guess I could run vim in LC_ALL=C but it feels I'm resorting to too many dirty tricks already. Is there a better way, i.e., without having to mess around with encoding?

How does VIM perform charset conversion?

I see the following paragraph on the vim documentation for the introduction of charset conversion:
Vim will automatically convert from one to another encoding in several places:
- When reading a file and 'fileencoding' is different from 'encoding'
- When writing a file and 'fileencoding' is different from 'encoding'
- When displaying characters and 'termencoding' is different from 'encoding'
- When reading input and 'termencoding' is different from 'encoding'
- When displaying messages and the encoding used for LC_MESSAGES differs from
'encoding' (requires a gettext version that supports this).
- When reading a Vim script where |:scriptencoding| is different from
'encoding'.
- When reading or writing a |viminfo| file.
I want to know who is converting to who? such as:
"When reading a file and 'fileencoding' is different from 'encoding'"
Is 'fileencoding' converted to 'encoding'? Or is 'encoding' converted to 'fileencoding'?
What is the relationship between the actual charset of the file and fileencoding and encoding?
If the actual charset of the file and the value of fileencoding are not equal, will the above conversion operations destroy the contents of the file?
UPDATE:
For example: the value of encoding is: utf-8 , vim opens a file: foo, and based on fileencodings matches a fileencoding value: sjis (assuming i don't know the actual encoding of this file.), I edited foo and used ":wq" to save and close the vim window. If I open the foo file again, is the actualencoding of this file the sjis specified by fileencoding or the utf-8 specified by encoding when I last edited?
'encoding' is the internal representation of any buffer text inside Vim; this is what Vim is working on. When you're dealing with different character sets (or if you don't care and work on a modern operating system), it's highly recommended to set this to utf-8, as the Unicode encoding ensures that any character can be represented and no information is lost. (And UTF-8 is the only Unicode representation that Vim internally supports; i.e. you cannot make it use a double-byte encoding like UTF-16.)
When you open a file in Vim, the list of possible encodings in 'fileencodings' (note the plural!) is considered:
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it.
So if a file doesn't look right, this is the option to tweak; alternatively, you can explicitly override the detection via the ++enc argument, e.g.
:edit ++enc=sjis japanese.txt
Now, Vim has the file's source encoding (persisted in (singular!) 'fileencoding'; this is needed for writing it back in the original encoding), and converts the character set (if different) to it's internal 'encoding'. All Vim commands operate on that, and on :write, the conversion happens in reverse (or optionally overridden by :w ++enc=...).
Conclusions
As long as the detected / passed encoding is right, and assuming the internal 'encoding' is able to represent all read characters (guaranteed with utf-8), there will be no data loss.
Likewise, as the original encoding is stored in 'fileencoding', writes of the file transparently convert back. Now, it could have happened that editing introduced a character that cannot be represented in the file's encoding (but you were able to edit it in because of Vim's internal Unicode encoding). Vim will then print E513: write error, conversion failed on writing, and you have to manually change the character(s), or choose a different target file encoding.
Example
A file with these Kanji characters 日本 is represented as follows in the SJIS encoding:
93fa 967b 0a
Each Kanji is stored in two bytes, and then you have the one-byte newline (LF) at the end.
With :set encoding=utf-8, this is represented internally as (g8 can tell you this):
e697 a5e6 9cac 0a
In UTF-8, each Kanji is stored in three bytes, the first Kanji is e6 97 a5.
Now if I edit the text, e.g. enclosing with (ASCII) parentheses, and :write, I get this:
2893 fa96 7b29 0a
The original SJIS encoding is restored, each Kanji is two bytes again, now with the added parentheses 28 and 29 around it.
Had I tried to edit in a ä character, the :write would have failed with the E513 error, as that character cannot be represented in SJIS.

Changing default encoding of vim to utf-8 not working

I've been trying to change the default encoding of vim by writing in the $HOME/.vimrc file the following lines
set fileencodings=utf-8
set fileencoding=utf-8
I understand that the first line makes vim try to read a file with utf-8 encoding and the second line makes vim save files always with an utf-8 encoding. But when I write a file called example.txt and write no special characters, I save it to verify
file -i example.txt
output:
example.txt: text/plain; charset=us-ascii
If I wrote then special characters on example.txt then it would display correctly
example.txt: text/plain; charset=utf-8
But I want always the encoding to be utf-8 even if the file does not have special characters. Why is it not working?
file looks at the content of the file to determine its encoding. If it only finds ASCII characters it can only conclude that the file is ASCII.
ASCII being a subset of UTF-8 (the basis of a number of other encodings), it is simply impossible for any program to tell if 123abc is anything other than ASCII.
Of course, if you add UTF-8 characters to that file, file will spot them and act accordingly.
So… on to the Vim side of the "problem".
fileencodings is a list of encodings considered by Vim when reading a file.
fileencoding is the encoding used by Vim when writing a specific file.
The default value of both options depends on the value of encoding, which is set during startup.
The "ideal" encoding is utf-8. With this, you get a sensible list for fileencodings: ucs-bom,utf-8,default,latin1, and a sensible value for fileencoding: utf-8 that pretty much guarantee a smooth experience as long as you stay within the confines of UTF-8.
See:
:help 'encoding'
:help 'fileencodings'
:help 'fileencoding'

Vim's encoding options

Although Vim's help is a treasure cave of information, in some cases I find it mindboggling. Its explanation of different encoding-related options is one such case.
Can someone please explain to me, in simple terms, what do encoding, fileencoding and fileencodings settings do, and how can I
a) view the encoding of the current file?
b) change the encoding of the current file?
c) do something else which is used often, but slips my mind right now?
encoding is used by Vim to know what character sets it supports and how characters are stored internally.
You shouldn't really modify this setting; it should default to something Unicodeish. Otherwise you couldn't read and write files with an extended character set.
Put :set encoding=utf-8 at the start of your vimrc if you are not sure, and never play with that setting again except if you have to read huge files for one session with a 1-byte encoding.
fileencoding stores the encoding of the current buffer.
You might read and write to this variable and it will do what you want.
When you modify it, the file will be marked as modified, and when you save it (:w or :up) to disk, it will be written with the encoding that you specified.
fileencodings tells Vim how to detect the encoding of every file you read (in order to determine the value of fileencoding). It is a list of encodings, that are tried in order, and the first encoding that is consistent with the binary contents of the file is assumed to be the encoding of the file you are reading.
Set it once and then forget it. You might need to change it if you know that you are going to open plenty of files and that they all use the same encoding, and you don't want to lose time trying to check other encodings. Default which is ucs-bom,utf8,latin1 is nice IMO if you are in Western Europe, because almost any file will be opened in the correct encoding. However with this setting, when you open plain ASCII files (ie, which byte representation would be the same in UTF8 and in any latin-based code page encoding) the file will be assumed to be UTF8, and saved as such.
Example: if you set fileencodings to latin1,utf8, every file that you open will be read as latin1 because trying to read a file with latin1 encoding never fails: there is a bijection between the 256 possible byte values and the individual characters in the character set.
Conversely if you try fileencodings=ucs-bom,utf8,latin1 Vim will first check for a byte-order-mark and decode Unicode files with BOM, then if it failed (no BOM) try to read your files in UTF-8, and if it fails (because some byte sequences in UTF8 are invalid) open your file in latin1.
In order to reload a file with proper encoding (case when fileencodings did not work properly) you can do: :e! ++enc=<the_encoding>.
tl;dr:
view the encoding of the current file: :echo &fileencoding (shorter: :echo &fenc or :set fenc? or :verb set fenc?)
change the encoding of the current file: :set fenc=…… and call then :w as many times as you want.
reload your file with proper encoding: :e! ++enc=…
encoding:
The internal representation. View or set with:
:set encoding
:set encoding = utf-8
fileencoding:
The representation that will be used when the file is written. View or set with:
:set fileencoding
:set fileencoding = utf-8
fileencodings:
The list of possible encodings that are tested when reading a file. View or set with:
:set fileencodings
:set fileencodings= utf-8,latin-1,cp1251
Here is the list of possible encodings from the vim documentation (mbyte-encoding)
Supported 'encoding' values are: *encoding-values*
1 latin1 8-bit characters (ISO 8859-1, also used for cp1252)
1 iso-8859-n ISO_8859 variant (n = 2 to 15)
1 koi8-r Russian
1 koi8-u Ukrainian
1 macroman MacRoman (Macintosh encoding)
1 8bit-{name} any 8-bit encoding (Vim specific name)
1 cp437 similar to iso-8859-1
1 cp737 similar to iso-8859-7
1 cp775 Baltic
1 cp850 similar to iso-8859-4
1 cp852 similar to iso-8859-1
1 cp855 similar to iso-8859-2
1 cp857 similar to iso-8859-5
1 cp860 similar to iso-8859-9
1 cp861 similar to iso-8859-1
1 cp862 similar to iso-8859-1
1 cp863 similar to iso-8859-8
1 cp865 similar to iso-8859-1
1 cp866 similar to iso-8859-5
1 cp869 similar to iso-8859-7
1 cp874 Thai
1 cp1250 Czech, Polish, etc.
1 cp1251 Cyrillic
1 cp1253 Greek
1 cp1254 Turkish
1 cp1255 Hebrew
1 cp1256 Arabic
1 cp1257 Baltic
1 cp1258 Vietnamese
1 cp{number} MS-Windows: any installed single-byte codepage
2 cp932 Japanese (Windows only)
2 euc-jp Japanese (Unix only)
2 sjis Japanese (Unix only)
2 cp949 Korean (Unix and Windows)
2 euc-kr Korean (Unix only)
2 cp936 simplified Chinese (Windows only)
2 euc-cn simplified Chinese (Unix only)
2 cp950 traditional Chinese (on Unix alias for big5)
2 big5 traditional Chinese (on Windows alias for cp950)
2 euc-tw traditional Chinese (Unix only)
2 2byte-{name} Unix: any double-byte encoding (Vim specific name)
2 cp{number} MS-Windows: any installed double-byte codepage
u utf-8 32 bit UTF-8 encoded Unicode (ISO/IEC 10646-1)
u ucs-2 16 bit UCS-2 encoded Unicode (ISO/IEC 10646-1)
u ucs-2le like ucs-2, little endian
u utf-16 ucs-2 extended with double-words for more characters
u utf-16le like utf-16, little endian
u ucs-4 32 bit UCS-4 encoded Unicode (ISO/IEC 10646-1)
u ucs-4le like ucs-4, little endian
The {name} can be any encoding name that your system supports. It is passed
to iconv() to convert between the encoding of the file and the current locale.
For MS-Windows "cp{number}" means using codepage {number}.
Examples:
:set encoding=8bit-cp1252
:set encoding=2byte-cp932
The MS-Windows codepage 1252 is very similar to latin1. For practical reasons
the same encoding is used and it's called latin1. 'isprint' can be used to
display the characters 0x80 - 0xA0 or not.
Several aliases can be used, they are translated to one of the names above.
An incomplete list:
1 ansi same as latin1 (obsolete, for backward compatibility)
2 japan Japanese: on Unix "euc-jp", on MS-Windows cp932
2 korea Korean: on Unix "euc-kr", on MS-Windows cp949
2 prc simplified Chinese: on Unix "euc-cn", on MS-Windows cp936
2 chinese same as "prc"
2 taiwan traditional Chinese: on Unix "euc-tw", on MS-Windows cp950
u utf8 same as utf-8
u unicode same as ucs-2
u ucs2be same as ucs-2 (big endian)
u ucs-2be same as ucs-2 (big endian)
u ucs-4be same as ucs-4 (big endian)
u utf-32 same as ucs-4
u utf-32le same as ucs-4le
default stands for the default value of 'encoding', depends on the
environment
For the UCS codes the byte order matters. This is tricky, use UTF-8 whenever
you can. The default is to use big-endian (most significant byte comes
first):
name bytes char
ucs-2 11 22 1122
ucs-2le 22 11 1122
ucs-4 11 22 33 44 11223344
ucs-4le 44 33 22 11 11223344
On MS-Windows systems you often want to use "ucs-2le", because it uses little
endian UCS-2.
There are a few encodings which are similar, but not exactly the same. Vim
treats them as if they were different encodings, so that conversion will be
done when needed. You might want to use the similar name to avoid conversion
or when conversion is not possible:
cp932, shift-jis, sjis
cp936, euc-cn

Resources