Text document going From DOS to Linux in Vim - linux

I was given a trace file in XML format (created on a Windows machine). When I open it in Vim or cat it on the command line (on Mac or Linux), it visually appears fine. But after an XML parser failed to load the document as I'd expect, I found out, after digging a little deeper, that there are non-printable chars througout:
h001:logs bill$ xxd trace.xml | head -n 3
0000000: fffe 3c00 3f00 7800 6d00 6c00 2000 7600 ..<.?.x.m.l. .v.
0000010: 6500 7200 7300 6900 6f00 6e00 3d00 2200 e.r.s.i.o.n.=.".
0000020: 3100 2e00 3000 2200 2000 6500 6e00 6300 1...0.". .e.n.c.
I then tried the following with no luck removing these non-printed chars:
:%s/[^[:print:]]//g
:%s/[^[:control:]]//g
:%s/[^[:null:]]//g
I'm figuring this is due to the fact I'm switching from Windows to Linux, but I'm not seeing any of the usual artifacts (e.g. ^M, ^#, etc).
Any thoughts on what's happening here and what would be the right way to remove these from within Vim?

The problem is your XML parser doesn't understand UTF-16.
You can convert it by opening an empty vim session and doing:
:e ++enc=utf-16le file.txt
:w ++enc=utf8
This will open the file with utf-16 little endian encoding, and the save it as utf-8.

Related

display accented characters on mintty (cygwin) console?

in a nutshell, i would like to be able to type and display characters from iso-8859-1 on my cygwin mintty. unfortunately i haven't figured out how to do this.
my locale :
$ locale
LANG=C.ISO-8859-1
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C
mintty is configured as an xterm (although it seems to make no difference what terminal emulation i choose), and through options => text, i have configured the 'locale' section as C and the character set as ISO-8859-1.
when i type any accented character from my keyboard, the character does not display on the terminal. however, if i invoke cat, the characters i type display correctly. also, when i edit using vi (well, vim, actually), i am able to type (and display) accented characters without problems. so the problem seems to have something to do with the shell and not with the terminal emulation itself.
furthermore, if i write a little script to make a file named, for example, être.utx, the file displays as ???tre.utx when i ls it. looking at its hex, i get
$ ls *.utx | od -c -tx1
0000000 357 203 252 t r e . u t x \n
ef 83 aa 74 72 65 2e 75 74 78 0a
0000013
so it seems the script i wrote is creating a file whose name begins with the trigramme 0xEF 0x83 0xAA, rather than the single-byte character whose encoding should be 0xEA. i don't know how to interpret this ; i know it isn't utf-8, which would be 0xC3 0xAA.
it appears there is only one character set in my cygwin configuration that is configured to support 8859-1 : norwegian. [of course, i suppose i could learn norwegian, but i would prefer something a bit less strenuous, if possible...]
in any case, does anyone have an idea what i am doing wrong ?
many thanks in advance.
Just set mintty's locale to something utf8-ish.
In my case:
Window Menu (Alt+Space)
Options… (o)
Text (l.h. panel)
Locale → en_GB
Character Set → UTF-8
[Save]
Quit and restart
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_ NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
$ echo $'\u2154'
⅔
Nice

Git messes up with non-ascii characters on Linux container

I have a .Net Core (C#) project with the following line in one of the classes:
var input = "£";
But when I do a git clone in a Docker container (microsoft/dotnet:2.2-sdk) it messes it up and displays it as � (in bash using cat).
And when I run it, its Utf-8 bytes are [239, 191, 189] = [EF, BF, BD] which seem to be a so-called Unicode replacement character.
Windows editor that I use is VS 2017, but character is displayed properly on other windows machines and parsed properly by dotnet run/test command, so I don't think this is a problem of failing to save the character incorrectly.
Any ideas why I am seeing such a mess and how to solve it?
Some details
I get bytes using Encoding.UTF8.GetBytes("£");
It works perfectly well on Windows 10 machine
Linux version Debian GNU/Linux 9 (stretch) from the cat /etc/os-release
locale -a returns C C.UTF-8 POSIX
On Windows Notepad++, when opened, is claims to be ANSI and is displayed correctly.
Running fgrep 'var input' file.cs | od -tx1 -c
0000100 76 61 72 20 69 6e 70 75 74 20 3d 20 22 a3 22 3b
v a r i n p u t = " 243 " ;
Your file contains a single byte a3 which corresponds to the Windows-1252 encoding for the character £. Your Linux system displays � because it is not a valid UTF-8 encoding.
You should configure Visual Studio to use UTF-8 instead of Windows-1252.

The output of a .txt file created in vi editor show all the text in one line

I have a log file for my scripts to get report that is report.txt file. When I see it in vi editor it is showing as I want it to be :
Sanity Report
Start time:Fri Mar 10 08:08:33 CST 2017
LABS:
1: lht1-u0 (172.28.152.240)
2: lht1-u1 (172.28.152.241)
BUILDS:
CCM: 455
AMM: 395
OEBase: 864
ACS_DM: 569
AMS_DM: 707
TC Area TC Title Status
System-VM0 install ------------------------------------- Passed
System-VM1 install ------------------------------------- Passed
OpensSaf start ----------------------------- Passed
Verify alarmd server is -------------------------------- Passed
Product install of AMM ------------------------- Passed
Product install of AMM ------------------------- Passed
But when I open the actual text file in windows (this file should be email to a group of people) it shows all the text in one line.
How can I change this?
This is probably related to the end of line characters.
To my knownledge, on Unix and Windows, the convention differs.
DOS / Windows uses :
\r\n
Unix only:
\n
You need a way to convert your unix style end of line to the windows style.
The use of a regex engine can do the trick.
I found my own answer. I found it here https://www.maketecheasier.com/convert-files-from-linux-format-windows/ and I used awk 'sub("$", "\r")' uniz.txt > windows.txt worked well!
For anyone who has a file like this who does not have access to Linux, open the file with Wordpad (make a change??) and save it.
Next time you open it with Notepad it will appear normal.

write all file types to a file - ubuntu

I want to write the file type of all files in a folder to a file.
So this is what i did
for xfile in *.*
do
file "x$file" > values
done
But for my bad luck it didnt run. and the error message was
: command not found
'/check.sh: line 3: syntax error near unexpected token `do
'/check.sh: line 3: `do
Could someone please help me get over this ?
Firstly, your variable is xfile but you attempt to use the variable file preceded by the literal x. I think that x$file should be $xfile.
Secondly, you're executing the file command for each file and overwriting the values file each time, meaning only the last one will appear. You can fix this by deleting the file up front and using >> (append) instead of > (overwrite).
Thirdly, the file command is perfectly capable of handling multiple arguments so you really only need:
file * >values
That also uses * because *.* is a Windows/DOS thingie - if you want all files under UNIX-like operating systems, * is the correct form (or .* as well if you want hidden files).
And finally, none of those suggestions above will fix your actual problem, which is almost certainly the presence of DOS-style line-endings in your file. That error message is a near perfect match to what you would see in that case.
On my system, when I input your script with DOS line endings, I get the same error message (mostly).
If you do an od -xcb check.sh, you will most likely see something like:
0000020 0a0d 6f64 0a0d 6966 656c 2220 7824 6966
\r \n d o \r \n f i l e " $ x f i
015 012 144 157 015 012 146 151 154 145 040 042 044 170 146 151
with the \r characters indicating the incorrect line endings.
You can use your favorite (decent) editor to view and remove these or simply install dos2unix with apt-get (though you may already have this) and do:
# sudo apt-get install dos2unix ## only if you don't have it already.
dos2unix check.sh >check2.sh
and then use check2.sh.
Although, keep in mind, the single-line:
file * >values
is still a better option than the for loop (but you'll still want to ensure you're using UNIX line endings).

using the -W option of vim

the -w and -W options of vim have theoretically the following effect:
-w {scriptout} All the characters that you type are recorded in the file
"scriptout", until you exit Vim.
This is useful if you want to create
a script file to be used with "vim -s"
or ":source!". When the "scriptout"
file already exists, new characters
are appended. See also
|complex-repeat|. {scriptout} cannot
start with a digit. {not in Vi}
-W {scriptout} Like -w, but do not append, overwrite an existing file.
{not in Vi}
But when I do this, the {scriptout} file will always begin with a hexadecimal sequence like 80 fd 60 (sometimes it is 80 fd 62).
I am using gvimportable.exe 7.3 from portableapps.com. With the -u NONE switch, it does the same.
What is this “magic number” for? Under Windows with gvim.exe I cannot replay my scriptout until I have removed those three leading bytes…
It seems that this feature, which could be very useful, is poorly documented.
Thank you for your answers.
(This answer is probably fragmented significantly, it took me a while playing around - I wanted to find a solution too because it intrigued me - not just the bounty of 200 :P. It more or less shows my train of thought and experimentation.)
I can now reproduce it with gvim on Linux, which is /usr/bin/vim.gnome -g; running as vim -g does just the same.
Delving into the code: (futile in this case, but fun to do and to learn how to do)
I've looked through the source code and I can now explain it somewhat (but not usefully!); it gets the outfile FILE (src/globals.h:1004) set (src/main.h:2275); this is then written to in src/getchar.h:1501, in the updatescript method which is used by gotchars (line 1215) which is used by vgetorpeek, which is used by vgetc and vpeekc... (no, I don't know where this is going!) then these are used in a number of places.
Anyway, I suppose the key is somewhere in src/gui.c, but I don't know where at the moment! It's also possible that some key sequence is being "sent" (physically or virtually, I don't know), but seeing as the issue is the same across platforms it would seem more likely to be a Vim issue than otherwise.
Interesting situations leading to a probable explanation:
It's also worth while noting that if you automatically quit, gvim -u NONE -w scriptout -c quit (:quit after loading) or gvim -u NONE -w scriptout -c quit (instant :quit, never shows GUI), the file scriptout is left empty.
Additionally, if you open gvim and then close it using the X button, pressing no keys:
0000000: 80fd 6280 fd63 80fd 62 ..b..c..b
If you open gvim, click away, click back and use :q:
0000000: 80fd 6280 fd63 80fd 6280 fd2c 80fd 2e3a ..b..c..b..,...:
0000010: 710d q.
So I think it's the window events are internally translated into something else. 80 fd 62 is the open sequence and 80 fd 63 80 fd 62 is the close sequence.
I've found another way of triggering 80fd as well, which leads me to thing it's some sort of "user has access to the window"; by default with GNOME in Ubuntu, Ctrl+Alt+S does something with the window (can't remember what it's called; slides it all up into the title bar, app inside loses keyboard control etc.). gvim ... (you know the arguments!), i<Ctrl+Alt+S (contracted) Ctrl+Alt+S (expanded) >Esc Z Q produces this for me:
0000000: 80fd 6269 3c80 fd63 80fd 623e 1b5a 51 ..bi<..c..b>.ZQ
Summary: so there we have what I believe is the solution; gVim catches the window messages in some form and - whether it should or shouldn't - puts them in its scriptout. If you think it shouldn't (or would like to know why they're left in or if they're even meant to be or whether you should care at all), ask on the Vim list, I think.
My best guess is that this is a bug in the GUI code of gVim.
Using gVim 7.3, if I run gvim -u NONE -W scriptout then I see the problem, but if I run vim -u NONE -W scriptout then the unwanted bytes are not present.
I also tested Vim 7.2 from the shell in Linux, the version of Vim included in Snow Leopard (7.2), and the GUI and terminal versions of MacVim 7.2 (with mvim -W and /Applications/MacVim/Contents/MacOS/Vim -W, respectively) and they all worked correctly.
Someone has done the hard work for us in the vimgolf project, in particular this well-commented file: https://github.com/igrigorik/vimgolf/blob/master/lib/vimgolf/lib/vimgolf/keylog.rb
0x80 in escape sequence for special two-byte codes. In this case they represent gvim focus events. See here:
# If you use gvim, you'll get an entry in your keylog every time the
# window gains or loses focus. These "keystrokes" should not show and
# should not be counted.
"\xfd\x60" => nil, # 7.2 Focus Gained compat
"\xfd\x61" => nil, # Focus Gained (GVIM) (>7.4.1433)
"\xfd\x62" => nil, # Focus Gained (GVIM)
"\xfd\x63" => nil, # Focus Lost (GVIM)

Resources