Removing Control Characters from a File

Removing Control Characters from a File - linux

I want to delete all the control characters from my file using linux bash commands.
There are some control characters like EOF (0x1A) especially which are causing the problem when I load my file in another software. I want to delete this.
Here is what I have tried so far:
this will list all the control characters:
cat -v -e -t file.txt | head -n 10
^A+^X$
^A1^X$
^D ^_$
^E-^D$
^E-^S$
^E1^V$
^F%^_$
^F-^D$
^F.^_$
^F/^_$
^F4EZ$
^G%$
This will list all the control characters using grep:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]'
+
1
-
-
1
%
-
.
/
matches the above output of cat command.
Now, I ran the following command to show all lines not containing control characters but it is still showing the same output as above (lines with control characters)
$ cat file.txt | head -n 10 | grep '[^[:cntrl:]]'
+
1
-
-
1
%
-
.
/
here is the output in hex format:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]' | od -t x2
0000000 2b01 0a18 3101 0a18 2004 0a1f 2d05 0a04
0000020 2d05 0a13 3105 0a16 2506 0a1f 2d06 0a04
0000040 2e06 0a1f 2f06 0a1f
0000050
as you can see, the hex values, 0x01, 0x18 are control characters.
I tried using the tr command to delete the control characters but got an error:
$ cat file.txt | tr -d "\r\n" "[:cntrl:]" >> test.txt
tr: extra operand `[:cntrl:]'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.
If I delete all control characters, I will end up deleting the newline and carriage return as well which is used as the newline characters on windows. How do I delete all the control characters keeping only the ones required like "\r\n"?
Thanks.

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:
$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

Based on this answer on unix.stackexchange, this should do the trick:
$ cat scriptfile.raw | col -b > scriptfile.clean

Try grep, like:
grep -o "[[:print:][:space:]]*" in.txt > out.txt
which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.
To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:
tr -d "[:cntrl:]"
If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.
cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"

A little late to the party: cat -v <file>
which I think is the easiest to remember of the lot!

Related

How to count number of occurrence consecutive pattern spanning over lines in Bash?

For example, I have a file like this. How can I count the number of occurrences of consecutive N's spanning over lines?
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNGgtgtgtatatatcatgtgtgatgtgtggtgtgtg
gggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNAGaggcatattgatctgttgttttattttcttacag
ttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The expected result is 4 because there are 4 groups of N
I tried grep -Eozc 'N+', but the result is 1
If possible, I hope the line number and length of N can be shown too

awk '$1=$1' FS='' OFS='\n' file | uniq -c | grep -c N
or
tr -d '\r\n' < file | grep -o 'N*' | grep -c .
Output:
4

In plain bash, without using any external command:
v=$(<file)X
v=${v//[[:space:]]}
v=${v//N[^N]/ }
v=${v//[^ ]}
echo ${#v}
Output:
4

A little long, but straightforward:
< tmp.txt \
tr -d '\n' | # Strip newlines
tr -s N | # Collapse strings of Ns to a single N
tr -dC N | # Strip anything that *isn't* an N
wc -c # Count the resulting Ns
As a one-liner:
< tmp.txt tr -d '\n' | tr -s N | tr -dC N | wc -c

Invoke a Ruby One-Liner from Bash
You can do this as a Ruby one-liner from Bash, whether reading from a file or standard input. For example:
$ ruby -e 'puts ARGF.read.delete("\n").scan(/N+/).count' example.txt
4
$ ruby -e 'puts ARGF.read.delete("\n").scan(/N+/).count' <<< "$str"
4
The notion is to slurp the whole file, remove all the newlines, and then count the groups of consecutive N characters.
Note: If you want to ignore isolated N's, then just scan for /N{2,}/ instead. That will only count runs of two or more N characters.

Assuming that your data is in a file called test.txt:
We read all data from it.
Show lines that match our pattern (Starts and ends with N and only contains N)
Count number of lines
So here is the code that do this:
cat test.txt | egrep -oe "^N*$" | wc -l

How to translate and remove non-printable characters? [duplicate]

I want to delete all the control characters from my file using linux bash commands.
There are some control characters like EOF (0x1A) especially which are causing the problem when I load my file in another software. I want to delete this.
Here is what I have tried so far:
this will list all the control characters:
cat -v -e -t file.txt | head -n 10
^A+^X$
^A1^X$
^D ^_$
^E-^D$
^E-^S$
^E1^V$
^F%^_$
^F-^D$
^F.^_$
^F/^_$
^F4EZ$
^G%$
This will list all the control characters using grep:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]'
+
1
-
-
1
%
-
.
/
matches the above output of cat command.
Now, I ran the following command to show all lines not containing control characters but it is still showing the same output as above (lines with control characters)
$ cat file.txt | head -n 10 | grep '[^[:cntrl:]]'
+
1
-
-
1
%
-
.
/
here is the output in hex format:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]' | od -t x2
0000000 2b01 0a18 3101 0a18 2004 0a1f 2d05 0a04
0000020 2d05 0a13 3105 0a16 2506 0a1f 2d06 0a04
0000040 2e06 0a1f 2f06 0a1f
0000050
as you can see, the hex values, 0x01, 0x18 are control characters.
I tried using the tr command to delete the control characters but got an error:
$ cat file.txt | tr -d "\r\n" "[:cntrl:]" >> test.txt
tr: extra operand `[:cntrl:]'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.
If I delete all control characters, I will end up deleting the newline and carriage return as well which is used as the newline characters on windows. How do I delete all the control characters keeping only the ones required like "\r\n"?
Thanks.

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:
$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

Based on this answer on unix.stackexchange, this should do the trick:
$ cat scriptfile.raw | col -b > scriptfile.clean

Try grep, like:
grep -o "[[:print:][:space:]]*" in.txt > out.txt
which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.
To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:
tr -d "[:cntrl:]"
If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.
cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"

A little late to the party: cat -v <file>
which I think is the easiest to remember of the lot!

How to concatenate multiple lines of output to one line?

If I run the command cat file | grep pattern, I get many lines of output. How do you concatenate all lines into one line, effectively replacing each "\n" with "\" " (end with " followed by space)?
cat file | grep pattern | xargs sed s/\n/ /g
isn't working for me.

Use tr '\n' ' ' to translate all newline characters to spaces:
$ grep pattern file | tr '\n' ' '
Note: grep reads files, cat concatenates files. Don't cat file | grep!
Edit:
tr can only handle single character translations. You could use awk to change the output record separator like:
$ grep pattern file | awk '{print}' ORS='" '
This would transform:
one
two
three
to:
one" two" three"

Piping output to xargs will concatenate each line of output to a single line with spaces:
grep pattern file | xargs
Or any command, eg. ls | xargs. The default limit of xargs output is ~4096 characters, but can be increased with eg. xargs -s 8192.
grep xargs

In bash echo without quotes remove carriage returns, tabs and multiple spaces
echo $(cat file)

This could be what you want
cat file | grep pattern | paste -sd' '
As to your edit, I'm not sure what it means, perhaps this?
cat file | grep pattern | paste -sd'~' | sed -e 's/~/" "/g'
(this assumes that ~ does not occur in file)

This is an example which produces output separated by commas. You can replace the comma by whatever separator you need.
cat <<EOD | xargs | sed 's/ /,/g'
> 1
> 2
> 3
> 4
> 5
> EOD
produces:
1,2,3,4,5

The fastest and easiest ways I know to solve this problem:
When we want to replace the new line character \n with the space:
xargs < file
xargs has own limits on the number of characters per line and the number of all characters combined, but we can increase them. Details can be found by running this command: xargs --show-limits and of course in the manual: man xargs
When we want to replace one character with another exactly one character:
tr '\n' ' ' < file
When we want to replace one character with many characters:
tr '\n' '~' < file | sed s/~/many_characters/g
First, we replace the newline characters \n for tildes ~ (or choose another unique character not present in the text), and then we replace the tilde characters with any other characters (many_characters) and we do it for each tilde (flag g).

Here is another simple method using awk:
# cat > file.txt
a
b
c
# cat file.txt | awk '{ printf("%s ", $0) }'
a b c
Also, if your file has columns, this gives an easy way to concatenate only certain columns:
# cat > cols.txt
a b c
d e f
# cat cols.txt | awk '{ printf("%s ", $2) }'
b e

I like the xargs solution, but if it's important to not collapse spaces, then one might instead do:
sed ':b;N;$!bb;s/\n/ /g'
That will replace newlines for spaces, without substituting the last line terminator like tr '\n' ' ' would.
This also allows you to use other joining strings besides a space, like a comma, etc, something that xargs cannot do:
$ seq 1 5 | sed ':b;N;$!bb;s/\n/,/g'
1,2,3,4,5

Here is the method using ex editor (part of Vim):
Join all lines and print to the standard output:
$ ex +%j +%p -scq! file
Join all lines in-place (in the file):
$ ex +%j -scwq file
Note: This will concatenate all lines inside the file it-self!

Probably the best way to do it is using 'awk' tool which will generate output into one line
$ awk ' /pattern/ {print}' ORS=' ' /path/to/file
It will merge all lines into one with space delimiter

paste -sd'~' giving error.
Here's what worked for me on mac using bash
cat file | grep pattern | paste -d' ' -s -
from man paste .
-d list Use one or more of the provided characters to replace the newline characters instead of the default tab. The characters
in list are used circularly, i.e., when list is exhausted the first character from list is reused. This continues until
a line from the last input file (in default operation) or the last line in each file (using the -s option) is displayed,
at which time paste begins selecting characters from the beginning of list again.
The following special characters can also be used in list:
\n newline character
\t tab character
\\ backslash character
\0 Empty string (not a null character).
Any other character preceded by a backslash is equivalent to the character itself.
-s Concatenate all of the lines of each separate input file in command line order. The newline character of every line
except the last line in each input file is replaced with the tab character, unless otherwise specified by the -d option.
If ‘-’ is specified for one or more of the input files, the standard input is used; standard input is read one line at a time,
circularly, for each instance of ‘-’.

On red hat linux I just use echo :
echo $(cat /some/file/name)
This gives me all records of a file on just one line.

unix - print distinct list of control characters in a file

For example given an input file like below:
sid|storeNo|latitude|longitude
2|1|-28.03õ720000
9|2
10
jgn
352|1|-28.03¿720000
9|2|fd¿kjhn422-405
000¥0543210|gf¿djk39
gfd|f¥d||fd
Output (the characters below can appear in any order):
¿õ¥
Does anyone have a function (awk, bash, perl.etc) that could scan each line and then output (in octal, hex or ascii - either is fine) a distinct list of the control characters (for simplicity, control characters being those above ascii char 126) found?
Using perl v5.8.8.

To print the bytes in octal:
perl -ne'printf "%03o\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
To print the bytes in hex:
perl -ne'printf "%02X\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
To print the original bytes:
perl -nE'say for /[^\x09\x0A\x20-\x7E]/g' file | sort -u

This should catch everything over ordinal value 126 without having to explicitly weed out outliers
#!/bin/bash
while IFS= read -n1 c; do
if (( $(printf "%d" "'$c") > 126)); then
echo "$c"
fi
done < ./infile | sort -u
Output
¥
¿
õ

To delete everything except the control characters:
tr -d '\0-\176' < input > output
To test:
printf 'foobar\n\377' | tr -d '\0-\176' | od -t c
See tr(1) man page for details.

sed -e 's/[A-Za-z0-9,|]//g' -e 's/-//g' -e 's/./&^M/g' | sort -u
Delete everything you don't want, put everything else on its own line, then sort -u the whole kit.
The "&^M" is "&" followed by Ctrl-V followed by Ctrl-M in Bash.
Unix wins.

How to join multiple lines of filenames into one with custom delimiter

How do I join the result of ls -1 into a single line and delimit it with whatever I want?

paste -s -d joins lines with a delimiter (e.g. ","), and does not leave a trailing delimiter:
ls -1 | paste -sd "," -

EDIT: Simply "ls -m" If you want your delimiter to be a comma
Ah, the power and simplicity !
ls -1 | tr '\n' ','
Change the comma "," to whatever you want. Note that this includes a "trailing comma" (for lists that end with a newline)

This replaces the last comma with a newline:
ls -1 | tr '\n' ',' | sed 's/,$/\n/'
ls -m includes newlines at the screen-width character (80th for example).
Mostly Bash (only ls is external):
saveIFS=$IFS; IFS=$'\n'
files=($(ls -1))
IFS=,
list=${files[*]}
IFS=$saveIFS
Using readarray (aka mapfile) in Bash 4:
readarray -t files < <(ls -1)
saveIFS=$IFS
IFS=,
list=${files[*]}
IFS=$saveIFS
Thanks to gniourf_gniourf for the suggestions.

I think this one is awesome
ls -1 | awk 'ORS=","'
ORS is the "output record separator" so now your lines will be joined with a comma.

Parsing ls in general is not advised, so alternative better way is to use find, for example:
find . -type f -print0 | tr '\0' ','
Or by using find and paste:
find . -type f | paste -d, -s
For general joining multiple lines (not related to file system), check: Concise and portable “join” on the Unix command-line.

The combination of setting IFS and use of "$*" can do what you want. I'm using a subshell so I don't interfere with this shell's $IFS
(set -- *; IFS=,; echo "$*")
To capture the output,
output=$(set -- *; IFS=,; echo "$*")

Adding on top of majkinetor's answer, here is the way of removing trailing delimiter(since I cannot just comment under his answer yet):
ls -1 | awk 'ORS=","' | head -c -1
Just remove as many trailing bytes as your delimiter counts for.
I like this approach because I can use multi character delimiters + other benefits of awk:
ls -1 | awk 'ORS=", "' | head -c -2
EDIT
As Peter has noticed, negative byte count is not supported in native MacOS version of head. This however can be easily fixed.
First, install coreutils. "The GNU Core Utilities are the basic file, shell and text manipulation utilities of the GNU operating system."
brew install coreutils
Commands also provided by MacOS are installed with the prefix "g". For example gls.
Once you have done this you can use ghead which has negative byte count, or better, make alias:
alias head="ghead"

Don't reinvent the wheel.
ls -m
It does exactly that.

just bash
mystring=$(printf "%s|" *)
echo ${mystring%|}

This command is for the PERL fans :
ls -1 | perl -l40pe0
Here 40 is the octal ascii code for space.
-p will process line by line and print
-l will take care of replacing the trailing \n with the ascii character we provide.
-e is to inform PERL we are doing command line execution.
0 means that there is actually no command to execute.
perl -e0 is same as perl -e ' '

To avoid potential newline confusion for tr we could add the -b flag to ls:
ls -1b | tr '\n' ';'

It looks like the answers already exist.
If you want
a, b, c format, use ls -m ( Tulains Córdova’s answer)
Or if you want a b c format, use ls | xargs (simpified version of Chris J’s answer)
Or if you want any other delimiter like |, use ls | paste -sd'|' (application of Artem’s answer)

The sed way,
sed -e ':a; N; $!ba; s/\n/,/g'
# :a # label called 'a'
# N # append next line into Pattern Space (see info sed)
# $!ba # if it's the last line ($) do not (!) jump to (b) label :a (a) - break loop
# s/\n/,/g # any substitution you want
Note:
This is linear in complexity, substituting only once after all lines are appended into sed's Pattern Space.
#AnandRajaseka's answer, and some other similar answers, such as here, are O(n²), because sed has to do substitute every time a new line is appended into the Pattern Space.
To compare,
seq 1 100000 | sed ':a; N; $!ba; s/\n/,/g' | head -c 80
# linear, in less than 0.1s
seq 1 100000 | sed ':a; /$/N; s/\n/,/; ta' | head -c 80
# quadratic, hung

sed -e :a -e '/$/N; s/\n/\\n/; ta' [filename]
Explanation:
-e - denotes a command to be executed
:a - is a label
/$/N - defines the scope of the match for the current and the (N)ext line
s/\n/\\n/; - replaces all EOL with \n
ta; - goto label a if the match is successful
Taken from my blog.

If you version of xargs supports the -d flag then this should work
ls | xargs -d, -L 1 echo
-d is the delimiter flag
If you do not have -d, then you can try the following
ls | xargs -I {} echo {}, | xargs echo
The first xargs allows you to specify your delimiter which is a comma in this example.

ls produces one column output when connected to a pipe, so the -1 is redundant.
Here's another perl answer using the builtin join function which doesn't leave a trailing delimiter:
ls | perl -F'\n' -0777 -anE 'say join ",", #F'
The obscure -0777 makes perl read all the input before running the program.
sed alternative that doesn't leave a trailing delimiter
ls | sed '$!s/$/,/' | tr -d '\n'

Python answer above is interesting, but the own language can even make the output nice:
ls -1 | python -c "import sys; print(sys.stdin.read().splitlines())"

You can use:
ls -1 | perl -pe 's/\n$/some_delimiter/'

If Python3 is your cup of tea, you can do this (but please explain why you would?):
ls -1 | python -c "import sys; print(','.join(sys.stdin.read().splitlines()))"

ls has the option -m to delimit the output with ", " a comma and a space.
ls -m | tr -d ' ' | tr ',' ';'
piping this result to tr to remove either the space or the comma will allow you to pipe the result again to tr to replace the delimiter.
in my example i replace the delimiter , with the delimiter ;
replace ; with whatever one character delimiter you prefer since tr only accounts for the first character in the strings you pass in as arguments.

You can use chomp to merge multiple line in single line:
perl -e 'while (<>) { if (/\$/ ) { chomp; } print ;}' bad0 >test
put line break condition in if statement.It can be special character or any delimiter.

Quick Perl version with trailing slash handling:
ls -1 | perl -E 'say join ", ", map {chomp; $_} <>'
To explain:
perl -E: execute Perl with features supports (say, ...)
say: print with a carrier return
join ", ", ARRAY_HERE: join an array with ", "
map {chomp; $_} ROWS: remove from each line the carrier return and return the result
<>: stdin, each line is a ROW, coupling with a map it will create an array of each ROW

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing Control Characters from a File - linux

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of: $ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

Based on this answer on unix.stackexchange, this should do the trick: $ cat scriptfile.raw | col -b > scriptfile.clean

A little late to the party: cat -v <file> which I think is the easiest to remember of the lot!

Related

How to count number of occurrence consecutive pattern spanning over lines in Bash?

How to translate and remove non-printable characters? [duplicate]

How to concatenate multiple lines of output to one line?

unix - print distinct list of control characters in a file

How to join multiple lines of filenames into one with custom delimiter

Categories

Resources