How to strip binary characters from a file? - linux

I've got a file that contains lines that look like this in vim:
^[[0;32msalt-2016.3.2-1.el6.noarch^[[0;0m^M
which look like this in more:
salt-2016.3.2-1.el6.noarch
I would like to produce a copy of this file that only contains the displayed characters as more shows them. I tried piping it through dos2unix but it refuses to do anything, complaining that "dos2unix: Binary symbol 0x1B found at line 2".
Probably I could achieve what I want with some sed statements, but I'm wondering whether there is a linux/unix utility that will take output from more or cat and produce a file that contains only the whitespace and text as displayed?

There's something called ansifilter which does exactly this. I tested it out on my file and it works.

Related

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

Search ill encoded characters in a file on Linux

I have a lot of huge CSV files, some of them contain ill encoded characters: in vi, I see things like "<8f>" or "<8e>", for example.
First, I wanted to search and replace (:%s) all the characters, but it will be a very long process because I will have to do this everytime I have to handle a file, and I'm not always sure whether new characters are here.
Is it possible to detect such characters, so that I can extract lines containing ill encoded characters?
A simple command may exist, taking a file for argument and creating a file containing only the lines with a problem.
I don't know if I explain me very well...
Thanks in advance!
You could use :g/char/p [vim] to print all the lines in a given file, or the bash utility grep:
grep -lr 'char1\|char2\|char2' .
Will output all the files in a directory containing any of the chars you have listed (the -r makes it recursive and the -l lists only the filenames, rather than all the line matches.

sort: string comparison failed Invalid or incomplete multibyte or wide character

I'm trying to use the following command on a text file:
$ sort <m.txt | uniq -c | sort -nr >m.dict
However I get the following error message:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘enwedig\r’ and ‘mwy\r’.
I'm using Cygwin on Windows 7 and was having trouble earlier editing m.txt to put each word within the file on a new line. Please see:
Using AWK to place each word in a text file on a new line
I'm not sure if I'm getting these errors due to this, or because m.txt contains characters from the Welsh alphabet (When I was working with Welsh text in Python, I was required t change the encoding to 'Latin-1').
I tried following the error message's advice and changing LC_ALL='C' however this has not helped. Can anyone elaborate on the errors I'm receiving and provide any advice on how I might go about trying to solve this problem.
UPDATE:
When trying dos2unix, errors were being displayed about invalid characters at certain lines. It turns out these were not Welsh characters, but other strange characters (arrows etc). I went through my text file removing these characters until I was able to use the dos2unix command without error. However, after using the dos2unix command all the text was concatenated (no spaces/newlines or anything, whereas it should have been so that each word in the file was on a seperate line) I then used unix2dos and the text file was back to normal. How can I each word on its own individual line and use the sort command without it giving me errors about '\r' characters?
I know it's an old question, but just running the command export LC_ALL='C' does the trick as described by sort: Set LC_ALL='C' to work around the problem..
Looks like a Windows line-ending related problem (\r\n versus \n). You can convert m.txt to Unix line-endings with
dos2unix m.txt
and then rerun your command.

Saving a flat-file through Vim add an invisible byte to the file that creates a new line

The title is not really specific, but I have trouble identifying the correct key words as I'm not sure what is going on here. For the same reason, it is possible that my question has a duplicate, as . If that's the case: sorry!
I have a Linux application that receive data via flat files. I don't know exactly how those files are generated, but I can read them without any problem. Those are short files, only a line each.
For test purpose, I tried to modify one of those files and reinjected it again in the application. But when I do that I can see in the log that it added a mysterious page break at the end of the message (resulting in the application not recognising the message)...
For the sake of example, let's say I receive a flat file, named original, that contains the following:
ABCDEF
I make a copy of this file and named it copy.
If I compare those two files using the "diff" command, it says they are identical (as I expect them to be)
If I open copy via Vi and then quit without changing nor saving anything and then use the "diff" command, it says they are identical (as I also expect them to be)
If I open copy via Vi and then save it without changing anything and then use the "diff" command, I have the following (I added the dot for layout purpose):
diff original copy
1c1
< ABCDEF
\ No newline at end of file
---
.> ABCDEF
And if I compare the size of my two files, I can see that original is 71 bytes when copy is 72.
It seems that the format of the file change when I save the file. I first thought of an encoding problem, so I used the ":set list" command on Vim to see the invisible characters. But for both files, I can see the following:
ABCDEF$
I have found other ways to do my test, But this problem still bugged me and I would really like to understand it. So, my two questions are:
What is happening here?
How can I modify a file like that without creating this mysterious page break?
Thank you for your help!
What happens is that Vim is set by default to assume that the files you edit end with a "newline" character. That's normal behavior in UNIX-land. But the "files" your program is reading look more like "streams" to me because they don't end with a newline character.
To ensure that those "files" are written without a newline character, set the following options before writing:
:set binary noeol
See :help 'eol'.

Removing lines containing encoding errors in a text file

I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.
When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).
You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt

Resources