Vi character encoding issue - vim

I am trying to see the contents of a Finnish text on vi. However, it replaces some letters with numbers. For example, <8a> stands for ä, etc.
I changed the character encoding to ISO8859-1, however still when viewing the file, it cannot show the umlaut letters. I also tried :set encoding=latin1 on vi, however it changes to different letters but not umlaut letters. Finally, I tried to replace those numbers with the original letters, but I am getting 'pattern not found' error. I am not sure whether I am doing the substitution correctly though: %s/<8a>/ä
Are there any more solution ideas?
vi screenshot

Try from command line:
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
being filename the original file you need to open with right encoding.

Related

how to visualise and delete trailing newline at the end of file in vim\nvim

Sometimes I need to edit files which should not end with a newline.
However vim\nvim by default do not visualise in any way the newline character at the end of file. Therefore I am not able to:
visually confirm if the file has a newline character at the end or not
remove that character
Are there any setting which would allow me to see the tailing newline character and edit it in the same way as any other characters?
For example, after create 2 files as follows:
echo test > file-with-newline
echo -n test > file-without-newline
opening first one with nvim file-with-newline shows:
test
~
~
file-with-newline
opening second one with nvim file-without-newline shows:
test
~
~
file-without-newline
Navigating with the cursor to the end of line in either case yields the same result (the cursor stops after last visible character: t). There is no way to tell if the newline is there or not, let alone remove it using familiar commands used to remove ordinary characters (or newlines within the file).
You can enable the option :help 'list':
:set list
to show that "newline character" as a $ at the end of the line (among other things):
Note, however, that the option doesn't make the character "editable" in any way.
if the file has a newline character at the end or not
:set eol?
endofline
remove that character
:set noeol nofixeol
:update

Preserving accentuated letters when running a PERL script from linux terminal

I want to get a plain text file from the French Wikipedia dump XML file.
To that end, I am applying a Perl script
I can give the full file if necessary, I only added the line
tr/a-zàâééèëêîôûùç-/ /cs;
to the script here: http://mattmahoney.net/dc/textdata.html
However, when I run on linux terminal:
perl filterwikifr.pl frwiki.xml > frwikiplaintext.txt
the output text file does not print accentuated letters correctly. For example, I get catégorie instead of catégorie...
I also tried:
perl -CS filterwikifr.pl frwiki.xml > frwikiplaintext.txt
without better success (and other variants instead of -CS...)
the problem is with the text editor gedit.
If, instead of opening the file directly, I open gedit, and then go to "open" and down, in "Character encoding", I choose UTF-8 instead of "Automatically Detected", then the accents are printed correctly.

Understanding sed

I am trying to understand how
sed 's/\^\[/\o33/g;s/\[1G\[/\[27G\[/' /var/log/boot
worked and what the pieces mean. The man page I read just confused me more and I tried the info sai Id but had no idea how to work it! I'm pretty new to Linux. Debian is my first distro but seemed like a rather logical place to start as it is a root of many others and has been around a while so probably is doing stuff well and fairly standardized. I am running Wheezy 64 bit as fyi if needed.
The sed command is a stream editor, reading its file (or STDIN) for input, applying commands to the input, and presenting the results (if any) to the output (STDOUT).
The general syntax for sed is
sed [OPTIONS] COMMAND FILE
In the shell command you gave:
sed 's/\^\[/\o33/g;s/\[1G\[/\[27G\[/' /var/log/boot
the sed command is s/\^\[/\o33/g;s/\[1G\[/\[27G\[/' and /var/log/boot is the file.
The given sed command is actually two separate commands:
s/\^\[/\o33/g
s/\[1G\[/\[27G\[/
The intent of #1, the s (substitute) command, is to replace all occurrences of '^[' with an octal value of 033 (the ESC character). However, there is a mistake in this sed command. The proper bash syntax for an escaped octal code is \nnn, so the proper way for this sed command to have been written is:
s/\^\[/\033/g
Notice the trailing g after the replacement string? It means to perform a global replacement; without it, only the first occurrence would be changed.
The purpose of #2 is to replace all occurrences of the string \[1G\[ with \[27G\[. However, this command also has a mistake: a trailing g is needed to cause a global replacement. So, this second command needs to be written like this:
s/\[1G\[/\[27G\[/g
Finally, putting all this together, the two sed commands are applied across the contents of the /var/log/boot file, where the output has had all occurrences of ^[ converted into \033, and the strings \[1G\[ have been converted to \[27G\[.

how to tell when vim has added EOL character

How I can see in vim when a file has a newline character at the end? It seems vim always shows one, whether it's truly there or not.
For example, opening a file that has a newline at the end:
echo "hi" > hi
# confirm that there are 3 characters, because of the extra newline added by echo
wc -c hi
# open in binary mode to prevent vim from adding its own newline
vim -b hi
:set list
This shows:
hi$
Now by comparison, a file without the newline:
# prevent echo from adding newline
echo -n "hi" > hi
# confirm that there are only 2 characters now
wc -c hi
# open in binary mode to prevent vim from adding its own newline
vim -b hi
:set list
Still shows:
hi$
So how can I see whether a file truly has a newline at the end or not in vim?
Vim stores this information in the 'endofline' buffer option. You can check with
:setlocal eol?
In the second case, Vim displays [noeol] when loading the file.
Don't you see the [noeol] output, when loading that file?
Vim warns you when it opens/writes a file with no <EOL> at the end of the last line:
"filename" [noeol] 4L, 38C
"filename" [noeol] 6L, 67C written

encoding problem?

i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..

Resources