How to replace non printable characters in file like <97> on linux [duplicate] - linux

I am trying to remove non-printable character (for e.g. ^#) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time.
I tried using
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME
but still the ^# characters are not removed.
Also I tried using
awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE
but it also did not help.
Can anybody suggest some alternative way to remove non-printable characters?
Used tr -cd but it is removing accented characters. But they are required in the file.

Perhaps you could go with the complement of [:print:], which contains all printable characters:
tr -cd '[:print:]' < file > newfile
If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):
sed 's/[^[:print:]]//g' file

Remove all control characters first:
tr -dc '\007-\011\012-\015\040-\376' < file > newfile
Then try your string:
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile
I believe that what you see ^# is in fact a zero value \0.
The tr filter from above will remove those as well.

strings -1 file... > outputfile
seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.
"man strings" will provide the documentation.

Was searching for this for a while & found a rather simple solution:
The package ansifilter does exactly this. All you need to do is just pipe the output through it.
On Mac:
brew install ansifilter
Then:
cat file.txt | ansifilter

Related

How to insert UTF-16 character with sed in a file?

I have a file coded as UTF-16 and I want to split each line into fields (at fixed positions) separated by commas.
I have tried the following:
Option 1)
sed -i 's/./&,/400;s/./&,/360;*<and so on for several positions in the line>* FILE
This seems to work, but when editing the file with vim, it is obvious that something is wrong, since the commas are displayed as a single character, but the other symbols are displayed as a two-byte character.
BEFORE: 2^#A^#U^#W^#2^#0^#1^#9^#0^#1^#0^#1^#0^#0^#0^#1^#0^#0^#
AFTER: 2^#,A^#U^#,W^#,2^#0^#1^#9^#0^#1^#0^#1^#,0^#0^#0^#1^#0^#0^#,
Option 2)
Then I tried to use sed again but, instead of "," I typed the UTF-16 code of the comma, that is 002c:
sed -i "s/./&\U002c/400...
or even
s/./&$(echo -ne '\u002c')/400...
None of these options worked, the results are exactly the same as in Option1.
The problem is not the insertion of a new character with sed.
I noticed that the file is UTF-16LE (I was assuming it was UTF-16BE). I converted it directly to ascii with
iconv -f UTF-16LE -t ASCII sourcefile >destinationfile
and after that I could handle the file as usual with sed, grep, awk, etc.

sed doesn't remove characters from UTF range properly

I want to clear my file from all characters except russian and arabic letters, "|" and space mark. Lets start with only arabic letters. So I have:
cat file.tzt | sed 's/[^\u0600-\u06FF]//g'
sed: -e expression #1, char 21: Invalid range end.
I have tried [\u0621-\u064A] - same.
I also tried to use {Arabic}, but it doesn't clean files properly at all.
Error looks kinda strange for me. Obviously, 064FF > 0621.
So, overall I want to have something like this:
cat file.tzt | sed 's/[^\u0600-\u06FFа-яА-Я |]//g'
And I am ok with awk or any other utility, but as I know sed is stable and reliable.
Perl understands UTF-8:
perl -CSD -pe 's/[^\N{U+0600}-\N{U+06FF}]//g' -- file.txt
-C turns of UTF-8 support, S means for stdin/stdout/stderr, D means for any i/o streams.
You can also use Unicode properties:
s/\P{Cyrillic}//g

Remove lines with japanese characters from a file

First question on here- I've searched around to put together an answer to this but have come up empty thus far.
I have a multi-line text file that I am cleaning up. Part of this is to remove lines that include Japanese characters. I have been using sed for my other operations but it is not working in this instance.
I was under the impression that using the -r switch and the \p{Han} regular expression would work (from looking at other questions of this kind), but it is not working in this case.
Here is my test string - running this returns the full string, and does not filter out the JP characters as I was expecting.
echo 80岁返老还童的处女: 第3话 | sed -r "s/\\p\{Han\}//g"
Am I missing something? Is there another command I should be using instead?
I think this might work for you:
echo "80岁返老还童的处女: 第3话" | tr -cd '[:print:]\n'
sed doesn't support unicode classes AFAIK, and nor support multibyte ranges.
-d deletes characters in SET1, and -c reverses it.
[:print:] matches all printable characters including space.
\n is a newline
The above will not only remove Japanese characters but all multibyte characters, including control characters.
Perl can also be used:
PERLIO=:utf8 perl -pe 's/\p{Han}//g' file
PERLIO=:utf8 tells Perl to tread input and output as UTF-8

How to replace a <85> to a new line in bash script

I’m running out of idea on how to replace this character “<85>” to a new line (please treat this as one character only – I think this is a non-printable character).
I tried this one in my script:
cat file | awk '{gsub(”<85>”,RS);print}' > /tmp/file.txt
but didn’t work.
I hope someone can help.
Thanks.
With sed: sed -e $'s/\302\205/\\n/' file > file.txt
Or awk: awk '{gsub("\302\205","\n")}7'
The magic here was in converting the <85> character to octal codepoints.
I used hexdump -b on a file I manually inserted that character into.
tr '\205' '\n' <file > file.txt
tr is the transliterate command; it translates one character to another (or deletes it, or …). The version of tr on Mac OS X doesn't recognize hexadecimal escapes, so you have to use octal, and octal 205 is hex 85.
I am assuming that the file contains a single byte '\x85', rather than some combination of bytes that is being presented as <85>. tr is not good for recognizing multibyte sequences that need to be transliterated.

Replace whitespace with a comma in a text file in Linux

I need to edit a few text files (an output from sar) and convert them into CSV files.
I need to change every whitespace (maybe it's a tab between the numbers in the output) using sed or awk functions (an easy shell script in Linux).
Can anyone help me? Every command I used didn't change the file at all; I tried gsub.
tr ' ' ',' <input >output
Substitutes each space with a comma, if you need you can make a pass with the -s flag (squeeze repeats), that replaces each input sequence of a repeated character that is listed in SET1 (the blank space) with a single occurrence of that character.
Use of squeeze repeats used to after substitute tabs:
tr -s '\t' <input | tr '\t' ',' >output
Try something like:
sed 's/[:space:]+/,/g' orig.txt > modified.txt
The character class [:space:] will match all whitespace (spaces, tabs, etc.). If you just want to replace a single character, eg. just space, use that only.
EDIT: Actually [:space:] includes carriage return, so this may not do what you want. The following will replace tabs and spaces.
sed 's/[:blank:]+/,/g' orig.txt > modified.txt
as will
sed 's/[\t ]+/,/g' orig.txt > modified.txt
In all of this, you need to be careful that the items in your file that are separated by whitespace don't contain their own whitespace that you want to keep, eg. two words.
without looking at your input file, only a guess
awk '{$1=$1}1' OFS=","
redirect to another file and rename as needed
What about something like this :
cat texte.txt | sed -e 's/\s/,/g' > texte-new.txt
(Yes, with some useless catting and piping ; could also use < to read from the file directly, I suppose -- used cat first to output the content of the file, and only after, I added sed to my command-line)
EDIT : as #ghostdog74 pointed out in a comment, there's definitly no need for thet cat/pipe ; you can give the name of the file to sed :
sed -e 's/\s/,/g' texte.txt > texte-new.txt
If "texte.txt" is this way :
$ cat texte.txt
this is a text
in which I want to replace
spaces by commas
You'll get a "texte-new.txt" that'll look like this :
$ cat texte-new.txt
this,is,a,text
in,which,I,want,to,replace
spaces,by,commas
I wouldn't go just replacing the old file by the new one (could be done with sed -i, if I remember correctly ; and as #ghostdog74 said, this one would accept creating the backup on the fly) : keeping might be wise, as a security measure (even if it means having to rename it to something like "texte-backup.txt")
This command should work:
sed "s/\s/,/g" < infile.txt > outfile.txt
Note that you have to redirect the output to a new file. The input file is not changed in place.
sed can do this:
sed 's/[\t ]/,/g' input.file
That will send to the console,
sed -i 's/[\t ]/,/g' input.file
will edit the file in-place
Here's a Perl script which will edit the files in-place:
perl -i.bak -lpe 's/\s+/,/g' files*
Consecutive whitespace is converted to a single comma.
Each input file is moved to .bak
These command-line options are used:
-i.bak edit in-place and make .bak copies
-p loop around every line of the input file, automatically print the line
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
If you want to replace an arbitrary sequence of blank characters (tab, space) with one comma, use the following:
sed 's/[\t ]+/,/g' input_file > output_file
or
sed -r 's/[[:blank:]]+/,/g' input_file > output_file
If some of your input lines include leading space characters which are redundant and don't need to be converted to commas, then first you need to get rid of them, and then convert the remaining blank characters to commas. For such case, use the following:
sed 's/ +//' input_file | sed 's/[\t ]+/,/g' > output_file
This worked for me.
sed -e 's/\s\+/,/g' input.txt >> output.csv

Resources