unix - print distinct list of control characters in a file

unix - print distinct list of control characters in a file - linux

For example given an input file like below:
sid|storeNo|latitude|longitude
2|1|-28.03õ720000
9|2
10
jgn
352|1|-28.03¿720000
9|2|fd¿kjhn422-405
000¥0543210|gf¿djk39
gfd|f¥d||fd
Output (the characters below can appear in any order):
¿õ¥
Does anyone have a function (awk, bash, perl.etc) that could scan each line and then output (in octal, hex or ascii - either is fine) a distinct list of the control characters (for simplicity, control characters being those above ascii char 126) found?
Using perl v5.8.8.

To print the bytes in octal:
perl -ne'printf "%03o\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
To print the bytes in hex:
perl -ne'printf "%02X\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
To print the original bytes:
perl -nE'say for /[^\x09\x0A\x20-\x7E]/g' file | sort -u

This should catch everything over ordinal value 126 without having to explicitly weed out outliers
#!/bin/bash
while IFS= read -n1 c; do
if (( $(printf "%d" "'$c") > 126)); then
echo "$c"
fi
done < ./infile | sort -u
Output
¥
¿
õ

To delete everything except the control characters:
tr -d '\0-\176' < input > output
To test:
printf 'foobar\n\377' | tr -d '\0-\176' | od -t c
See tr(1) man page for details.

sed -e 's/[A-Za-z0-9,|]//g' -e 's/-//g' -e 's/./&^M/g' | sort -u
Delete everything you don't want, put everything else on its own line, then sort -u the whole kit.
The "&^M" is "&" followed by Ctrl-V followed by Ctrl-M in Bash.
Unix wins.

Related

echo without trimming the space in awk command

I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL

The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.

A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.

How to translate and remove non-printable characters? [duplicate]

I want to delete all the control characters from my file using linux bash commands.
There are some control characters like EOF (0x1A) especially which are causing the problem when I load my file in another software. I want to delete this.
Here is what I have tried so far:
this will list all the control characters:
cat -v -e -t file.txt | head -n 10
^A+^X$
^A1^X$
^D ^_$
^E-^D$
^E-^S$
^E1^V$
^F%^_$
^F-^D$
^F.^_$
^F/^_$
^F4EZ$
^G%$
This will list all the control characters using grep:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]'
+
1
-
-
1
%
-
.
/
matches the above output of cat command.
Now, I ran the following command to show all lines not containing control characters but it is still showing the same output as above (lines with control characters)
$ cat file.txt | head -n 10 | grep '[^[:cntrl:]]'
+
1
-
-
1
%
-
.
/
here is the output in hex format:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]' | od -t x2
0000000 2b01 0a18 3101 0a18 2004 0a1f 2d05 0a04
0000020 2d05 0a13 3105 0a16 2506 0a1f 2d06 0a04
0000040 2e06 0a1f 2f06 0a1f
0000050
as you can see, the hex values, 0x01, 0x18 are control characters.
I tried using the tr command to delete the control characters but got an error:
$ cat file.txt | tr -d "\r\n" "[:cntrl:]" >> test.txt
tr: extra operand `[:cntrl:]'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.
If I delete all control characters, I will end up deleting the newline and carriage return as well which is used as the newline characters on windows. How do I delete all the control characters keeping only the ones required like "\r\n"?
Thanks.

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:
$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

Based on this answer on unix.stackexchange, this should do the trick:
$ cat scriptfile.raw | col -b > scriptfile.clean

Try grep, like:
grep -o "[[:print:][:space:]]*" in.txt > out.txt
which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.
To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:
tr -d "[:cntrl:]"
If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.
cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"

A little late to the party: cat -v <file>
which I think is the easiest to remember of the lot!

Removing Control Characters from a File

I want to delete all the control characters from my file using linux bash commands.
There are some control characters like EOF (0x1A) especially which are causing the problem when I load my file in another software. I want to delete this.
Here is what I have tried so far:
this will list all the control characters:
cat -v -e -t file.txt | head -n 10
^A+^X$
^A1^X$
^D ^_$
^E-^D$
^E-^S$
^E1^V$
^F%^_$
^F-^D$
^F.^_$
^F/^_$
^F4EZ$
^G%$
This will list all the control characters using grep:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]'
+
1
-
-
1
%
-
.
/
matches the above output of cat command.
Now, I ran the following command to show all lines not containing control characters but it is still showing the same output as above (lines with control characters)
$ cat file.txt | head -n 10 | grep '[^[:cntrl:]]'
+
1
-
-
1
%
-
.
/
here is the output in hex format:
$ cat file.txt | head -n 10 | grep '[[:cntrl:]]' | od -t x2
0000000 2b01 0a18 3101 0a18 2004 0a1f 2d05 0a04
0000020 2d05 0a13 3105 0a16 2506 0a1f 2d06 0a04
0000040 2e06 0a1f 2f06 0a1f
0000050
as you can see, the hex values, 0x01, 0x18 are control characters.
I tried using the tr command to delete the control characters but got an error:
$ cat file.txt | tr -d "\r\n" "[:cntrl:]" >> test.txt
tr: extra operand `[:cntrl:]'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.
If I delete all control characters, I will end up deleting the newline and carriage return as well which is used as the newline characters on windows. How do I delete all the control characters keeping only the ones required like "\r\n"?
Thanks.

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:
$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

Based on this answer on unix.stackexchange, this should do the trick:
$ cat scriptfile.raw | col -b > scriptfile.clean

Try grep, like:
grep -o "[[:print:][:space:]]*" in.txt > out.txt
which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.
To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:
tr -d "[:cntrl:]"
If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.
cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"

A little late to the party: cat -v <file>
which I think is the easiest to remember of the lot!

How to dump part of binary file

I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)
In the past I've used dd to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.
What tool on terminal can do this?

Locate the start/end position, then extract the range.
$ xxd -g0 input.bin | grep -im1 FFD8FFD0 | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin

In a single pipe:
xxd -c1 -p file |
awk -v b="ffd8ffd0" -v e="aaffd9" '
found == 1 {
print $0
str = str $0
if (str == e) {found = 0; exit}
if (length(str) == length(e)) str = substr(str, 3)}
found == 0 {
str = str $0
if (str == b) {found = 1; print str; str = ""}
if (length(str) == length(b)) str = substr(str, 3)}
END{ exit found }' |
xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file
The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.
The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.
Note that en empty file also mean that nothing has been found.

This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).
file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex
start=$((($(grep -bo "${startpattern}" ${file}.hex\
| head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
| head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}
Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep. A one-liner is also possible, at the expense of clarity.
One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.

See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.

A variation on the awk solution that assumes that your binary file, once converted in hex with spaces, fits in memory:
xxd -c1 -p file |
tr "\n" " " |
sed -n -e 's/.*\(ff d8 ff d0.*aa ff d9\).*/\1/p' |
xxd -r -p > new_file

Another solution in sed, but using less memory:
xxd -c1 -p file |
sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' |
sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file
The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e '1{N;N;N}' as there is bytes in your 1st pattern less one.
The 2nd sed prints from the beginning of the file to aa ff d9. Note again that you need as much N in -e '1{N;N}' as there is bytes in your 2nd pattern less one.
Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.
Note that the Q command is a GNU extension to sed. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.

How to join multiple lines of filenames into one with custom delimiter

How do I join the result of ls -1 into a single line and delimit it with whatever I want?

paste -s -d joins lines with a delimiter (e.g. ","), and does not leave a trailing delimiter:
ls -1 | paste -sd "," -

EDIT: Simply "ls -m" If you want your delimiter to be a comma
Ah, the power and simplicity !
ls -1 | tr '\n' ','
Change the comma "," to whatever you want. Note that this includes a "trailing comma" (for lists that end with a newline)

This replaces the last comma with a newline:
ls -1 | tr '\n' ',' | sed 's/,$/\n/'
ls -m includes newlines at the screen-width character (80th for example).
Mostly Bash (only ls is external):
saveIFS=$IFS; IFS=$'\n'
files=($(ls -1))
IFS=,
list=${files[*]}
IFS=$saveIFS
Using readarray (aka mapfile) in Bash 4:
readarray -t files < <(ls -1)
saveIFS=$IFS
IFS=,
list=${files[*]}
IFS=$saveIFS
Thanks to gniourf_gniourf for the suggestions.

I think this one is awesome
ls -1 | awk 'ORS=","'
ORS is the "output record separator" so now your lines will be joined with a comma.

Parsing ls in general is not advised, so alternative better way is to use find, for example:
find . -type f -print0 | tr '\0' ','
Or by using find and paste:
find . -type f | paste -d, -s
For general joining multiple lines (not related to file system), check: Concise and portable “join” on the Unix command-line.

The combination of setting IFS and use of "$*" can do what you want. I'm using a subshell so I don't interfere with this shell's $IFS
(set -- *; IFS=,; echo "$*")
To capture the output,
output=$(set -- *; IFS=,; echo "$*")

Adding on top of majkinetor's answer, here is the way of removing trailing delimiter(since I cannot just comment under his answer yet):
ls -1 | awk 'ORS=","' | head -c -1
Just remove as many trailing bytes as your delimiter counts for.
I like this approach because I can use multi character delimiters + other benefits of awk:
ls -1 | awk 'ORS=", "' | head -c -2
EDIT
As Peter has noticed, negative byte count is not supported in native MacOS version of head. This however can be easily fixed.
First, install coreutils. "The GNU Core Utilities are the basic file, shell and text manipulation utilities of the GNU operating system."
brew install coreutils
Commands also provided by MacOS are installed with the prefix "g". For example gls.
Once you have done this you can use ghead which has negative byte count, or better, make alias:
alias head="ghead"

Don't reinvent the wheel.
ls -m
It does exactly that.

just bash
mystring=$(printf "%s|" *)
echo ${mystring%|}

This command is for the PERL fans :
ls -1 | perl -l40pe0
Here 40 is the octal ascii code for space.
-p will process line by line and print
-l will take care of replacing the trailing \n with the ascii character we provide.
-e is to inform PERL we are doing command line execution.
0 means that there is actually no command to execute.
perl -e0 is same as perl -e ' '

To avoid potential newline confusion for tr we could add the -b flag to ls:
ls -1b | tr '\n' ';'

It looks like the answers already exist.
If you want
a, b, c format, use ls -m ( Tulains Córdova’s answer)
Or if you want a b c format, use ls | xargs (simpified version of Chris J’s answer)
Or if you want any other delimiter like |, use ls | paste -sd'|' (application of Artem’s answer)

The sed way,
sed -e ':a; N; $!ba; s/\n/,/g'
# :a # label called 'a'
# N # append next line into Pattern Space (see info sed)
# $!ba # if it's the last line ($) do not (!) jump to (b) label :a (a) - break loop
# s/\n/,/g # any substitution you want
Note:
This is linear in complexity, substituting only once after all lines are appended into sed's Pattern Space.
#AnandRajaseka's answer, and some other similar answers, such as here, are O(n²), because sed has to do substitute every time a new line is appended into the Pattern Space.
To compare,
seq 1 100000 | sed ':a; N; $!ba; s/\n/,/g' | head -c 80
# linear, in less than 0.1s
seq 1 100000 | sed ':a; /$/N; s/\n/,/; ta' | head -c 80
# quadratic, hung

sed -e :a -e '/$/N; s/\n/\\n/; ta' [filename]
Explanation:
-e - denotes a command to be executed
:a - is a label
/$/N - defines the scope of the match for the current and the (N)ext line
s/\n/\\n/; - replaces all EOL with \n
ta; - goto label a if the match is successful
Taken from my blog.

If you version of xargs supports the -d flag then this should work
ls | xargs -d, -L 1 echo
-d is the delimiter flag
If you do not have -d, then you can try the following
ls | xargs -I {} echo {}, | xargs echo
The first xargs allows you to specify your delimiter which is a comma in this example.

ls produces one column output when connected to a pipe, so the -1 is redundant.
Here's another perl answer using the builtin join function which doesn't leave a trailing delimiter:
ls | perl -F'\n' -0777 -anE 'say join ",", #F'
The obscure -0777 makes perl read all the input before running the program.
sed alternative that doesn't leave a trailing delimiter
ls | sed '$!s/$/,/' | tr -d '\n'

Python answer above is interesting, but the own language can even make the output nice:
ls -1 | python -c "import sys; print(sys.stdin.read().splitlines())"

You can use:
ls -1 | perl -pe 's/\n$/some_delimiter/'

If Python3 is your cup of tea, you can do this (but please explain why you would?):
ls -1 | python -c "import sys; print(','.join(sys.stdin.read().splitlines()))"

ls has the option -m to delimit the output with ", " a comma and a space.
ls -m | tr -d ' ' | tr ',' ';'
piping this result to tr to remove either the space or the comma will allow you to pipe the result again to tr to replace the delimiter.
in my example i replace the delimiter , with the delimiter ;
replace ; with whatever one character delimiter you prefer since tr only accounts for the first character in the strings you pass in as arguments.

You can use chomp to merge multiple line in single line:
perl -e 'while (<>) { if (/\$/ ) { chomp; } print ;}' bad0 >test
put line break condition in if statement.It can be special character or any delimiter.

Quick Perl version with trailing slash handling:
ls -1 | perl -E 'say join ", ", map {chomp; $_} <>'
To explain:
perl -E: execute Perl with features supports (say, ...)
say: print with a carrier return
join ", ", ARRAY_HERE: join an array with ", "
map {chomp; $_} ROWS: remove from each line the carrier return and return the result
<>: stdin, each line is a ROW, coupling with a map it will create an array of each ROW

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

unix - print distinct list of control characters in a file - linux

To print the bytes in octal: perl -ne'printf "%03o\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u To print the bytes in hex: perl -ne'printf "%02X\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u To print the original bytes: perl -nE'say for /[^\x09\x0A\x20-\x7E]/g' file | sort -u

This should catch everything over ordinal value 126 without having to explicitly weed out outliers #!/bin/bash while IFS= read -n1 c; do if (( $(printf "%d" "'$c") > 126)); then echo "$c" fi done < ./infile | sort -u Output ¥ ¿ õ

To delete everything except the control characters: tr -d '\0-\176' < input > output To test: printf 'foobar\n\377' | tr -d '\0-\176' | od -t c See tr(1) man page for details.

sed -e 's/[A-Za-z0-9,|]//g' -e 's/-//g' -e 's/./&^M/g' | sort -u Delete everything you don't want, put everything else on its own line, then sort -u the whole kit. The "&^M" is "&" followed by Ctrl-V followed by Ctrl-M in Bash. Unix wins.

Related

echo without trimming the space in awk command

How to translate and remove non-printable characters? [duplicate]

Removing Control Characters from a File

How to dump part of binary file

How to join multiple lines of filenames into one with custom delimiter

Categories

Resources