Find files with non-printing characters (null bytes) - linux

I have got the log of my application with a field that contains strange characters.
I see these characters only when I use less command.
I tried to copy the result of my line of code in a text file and what I see is
CTP_OUT=^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
I'd like to know if there is a way to find these null characters. I have tried with a grep command but it didn't show anything

I hardly believe it, I might write an answer involving cat!
The characters you are observing are non-printable characters which are often written in Carret notation. The Caret notation of a character is a way to visualize non-printable characters. As mentioned in the OP, ^# is the representation of NULL.
If your file has non-printable characters, you can visualize them using cat -vET:
-E, --show-ends: display $ at end of each line
-T, --show-tabs: display TAB characters as ^I
-v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB
source: man cat
I've added the -E and -T flag to it, to convert everything non-printable.
As grep will not output the non-printable characters itself in any form, you have to pipe its output to cat to see them. The following example shows all lines containing non-printable characters
Show all lines with non-printable characters:
$ grep -E '[^[:print:]]' --color=never file | cat -vET
Here, the ERE [^[:print:]] selects all non-printable characters.
Show all lines with NULL:
$ grep -Pa '\x00' --color=never file | cat -vET
Be aware that we need to make use of the Perl regular expressions here as they understand the hexadecimal and octal notation.
Various control characters can be written in C language style: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc.
More generally, \nnn, where nnn is a string of three octal digits, matches the character whose native code point is nnn. You can easily run into trouble if you don't have exactly three digits. So always use three, or since Perl 5.14, you can use \o{...} to specify any number of octal digits.
Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.
source: Perl 5 version 26.1 documentation
An example:
$ printf 'foo\012\011\011bar\014\010\012foobar\012\011\000\013\000car\012\011\011\011\012' > test.txt
$ cat test.txt
foo
bar
foobar
car
If we now use grep alone, we get the following:
$ grep -Pa '\x00' --color=never test.txt
car
But piping it to cat allows us to visualize the control characters:
$ grep -Pa '\x00' --color=never test.txt | cat -vET
^I^#^K^#car$
Why --color=never: If your grep is tuned to have --color=auto or --color=always it will add extra control characters to be interpreted as color for the terminal. And this might confuse you by the content.
$ grep -Pa '\x00' --color=always test.txt | cat -vET
^I^[[01;31m^[[K^#^[[m^[[K^K^[[01;31m^[[K^#^[[m^[[Kcar$

sed can.
sed -n '/\x0/ { s/\x0/<NUL>/g; p}' file
-n skips printing any output unless explicitly requested.
/\x0/ selects for only lines with null bytes.
{...} encapsulates multiple commands, so that they can be collectively applied always and only when the /\x0/ has detected a null on the line.
s/\x0/<NUL>/g; substitutes in a new, visible value for the null bytes. You could make it whatever you want - I used <NUL> as something both reasonably obvious and yet unlikely to occur otherwise. You should probably grep the file for it first to be sure the pattern doesn't exist before using it.
p; causes lines that have been edited (because they had a null byte) to show.
This basically makes sed an effective grep for nulls.

Related

sed doesn't remove characters from UTF range properly

I want to clear my file from all characters except russian and arabic letters, "|" and space mark. Lets start with only arabic letters. So I have:
cat file.tzt | sed 's/[^\u0600-\u06FF]//g'
sed: -e expression #1, char 21: Invalid range end.
I have tried [\u0621-\u064A] - same.
I also tried to use {Arabic}, but it doesn't clean files properly at all.
Error looks kinda strange for me. Obviously, 064FF > 0621.
So, overall I want to have something like this:
cat file.tzt | sed 's/[^\u0600-\u06FFа-яА-Я |]//g'
And I am ok with awk or any other utility, but as I know sed is stable and reliable.
Perl understands UTF-8:
perl -CSD -pe 's/[^\N{U+0600}-\N{U+06FF}]//g' -- file.txt
-C turns of UTF-8 support, S means for stdin/stdout/stderr, D means for any i/o streams.
You can also use Unicode properties:
s/\P{Cyrillic}//g

Sed is not writing to file

I wanna simply change the delimiter on my CSV.
The file comes from a outside server, so the delimiter is something like this: ^A.
name^Atype^Avalue^A
john^Ab^A500
mary^Ac^A400
jack^Ad^A200
I want to get this:
name,type,value
john,b,500
mary,c,400
jack,d,200
I need to change it to a comma(,) or a tab(,), but my sed command, despite correctly output, does not write the file.
cat -v CSVFILE | sed -i "s/\^A/,/g"
When i use the line above, it correctly outputs the file delimited by a comma instead of ^A, but it doesn't write to the file.
I also tried like this:
sed -i "s/\^A/,/g" CSVFILE
Does not work also...
What am i doing wrong?
Literal ^A (two characters, ^ and A) is how cat -v visualizes control character 0x1 (ASCII code 1, named SOH (start of heading)). ^A is an example of caret notation to represent unprintable ASCII characters:
^A stands for keyboard combination Control-A, which, when preceded by generic escape sequence Control-V, is how you can create the actual control character in your terminal; in other words,
Control-VControl-A will insert an actual 0x1 character.
Incidentally, the logic of caret notation (^<letter>) is: the letter corresponds to the ASCII value of the control character represented; e.g., A corresponds to 0x1, and D corresponds to 0x4 (^D, EOT).
To put it differently: you add 0x40 to the ASCII value of the control character to get the ASCII value of its letter representation in caret notation.
^# to represent NUL (0x0 characters) and ^? to represent DEL (0x7f) are consistent with this notation, because # has ASCII value 0x40 (i.e., it comes just before A (0x41) in the ASCII table) and 0x40 + 0x7f constrained to 7 bits (bit-ANDed with the max. ASCII value 0x7f) yields 0x3f, which is the ASCII value of ?.
To inspect a given file for the ASCII values of exotic control characters, you can pipe it to od -c, which represents 0x1 as (octal) 001.
This implies that, when passing the file to sed directly, you cannot use caret notation and must instead use the actual control character in your s call.
Note that when you use Control-VControl-A to create an actual 0x1 character, it will also appear in caret notation - as ^A - but in that case it is just the terminal's visualization of the true control character; while it may look like the two printable characters ^ and A, it is not. Purely visually you cannot tell the difference - which is why using an escape sequence or ANSI C-quoted string to represent the control character is the better choice - see below.
Assuming your shell is bash, ksh, or zsh, the better alternative to using Control-VControl-A is to use an ANSI C-quoted string to generate the 0x1 character: $'\1'
However, as Lars Fischer points out in a comment on the question, GNU sed also recognizes escape sequence \x01 for 0x1.
Thus, your command should be:
sed -i 's/\x01/,/g' CSVFILE # \x01 only recognized by GNU sed
or, using an ANSI C-quoted string:
sed -i $'s/\1/,/g' CSVFILE
Note: While this form can in principle be used with BSD/OSX sed, the -i syntax is slightly different: you'd have to use sed -i '' $'s/\1/,/g' CSVFILE
The only reason to use sed for your task is to take advantage of in-place updating (-i); otherwise, tr is the better choice - see Ed Morton's answer.
This is the job tr was created to do:
tr '<control-A>' ',' < file > tmp && mv tmp file
Replace <control-A> with a literal control-A obviously.
If your sed supports the -i option, you could use it like this:
sed -i.bak -e "s/\^A/,/g" CSVFILE
(This assumes the delimiter in the source file consists of the two characters ^ and A; if ^A is supposed to refer to Control-A, then you will have to make adjustments accordingly, e.g. using 's/\x01/,/g'.)
Otherwise, assuming you want to keep a copy of the original file (e.g. in case the result is not what you expect -- see below), an incantation such as the following can be used:
mv CSVFILE CSVFILE.bak && sed "s/\^A/,/g" CSVFILE.bak > CSVFILE
As pointed out elsewhere, if the source-file separator is Control-A, you could also use tr '\001' , (or tr '\001' '\t' for a tab).
The caution is that the delimiter in the source file might well be used precisely because commas might appear in the "values" that the separator-character is separating. If that is a possibility, then a different approach will be needed. (See e.g. https://www.rfc-editor.org/rfc/rfc4180)
In case it's run under OS X :
Add an extension to the -i to write in a new file :
sed -i.bak "s/^A/,/g" CSVFILE
Or to write in place :
sed -i '' "s/^A/,/g" CSVFILE
You can also output to file with a cat but without -i on your sed
command :
cat -v CSVFILE | sed "s/^A/,/g" > ouput
Make sure you write the ^A this way :
Ctrl+V+Ctrl+A

Line numbering in Grep

I have command in Grep:
cat nastava.html | grep '<td>[A-Z a-z]*</td><td>[0-9/]*</td>' | sed 's/[ \t]*<td>\([A-Z a-z]*\)<\/td><td>\([0-9]\{1,3\}\)\/[0-9]\{2\}\([0-9]\{2\}\)<\/td>.*/\1 mi\3\2 /'
|sort|grep -n ".*" | sed -r 's/(.*):(.*)/\1. \2/' >studenti.txt
I don't understand second line, sort is ok, grep -n means to num that sorted list, but why do we use here ".*"? It won't work without it, and i don't understand why.
The grep is used purely for the side effect of the line numbering with the -n option here, so the main thing is really to use a regular expression which matches all the input lines. As such, .* is not very elegant -- ^ would work without scanning every line, and $ trivially matches every line as well. Since you know the input lines are not empty, thus contain at least one character, the simple regular expression . would work perfectly, too.
However, as the end goal is to perform line numbering, a better solution is to use a dedicated tool for this purpose.
... | sort | nl -ba -s '. '
The -ba option specifies to number all lines (the default is to only add a line number to non-empty lines; we know there are no empty lines, so it's not strictly necessary here, but it's good to know) and the -s option specifies the separator string to put after the number.
A possible minor complication is that the line number format is whitespace-padded, so in the end, this solution may not work for you if you specifically want unpadded numbers. (But a sed postprocessor to fix that up is a lot simpler than the postprocessor for grep you have now -- just sed 's/^ *//' will remove leading whitespace).
... As an aside, the ugly cat | grep | sed pipeline can be abbreviated to just
sed -n 's%[ \t]*<td>\([A-Z a-z]*\)</td><td>\([0-9]\{1,3\}\)/[0-9]\{2\}\([0-9]\{2\}\)</td>.*%\1 mi\3\2 %p' nastava.html
The cat was never necessary in the first place, and the sed script can easily be refactored to only print when a substitution was performed (your grep regular expression was not exactly equivalent to the one you have in the sed script but I assume that was the intent). Also, using a different separator avoids having to backslash the slashes.
... And of course, if nastava.html is your own web page, the whole process is umop apisdn. You should have the students results in a machine-readable form, and generate a web page from that, rather than the other way around.
grep needs a regular expression to match. You can't run grep with no expression at all. If you want to number all the lines, just specify an expression that matches anything. I'd probably use ^ instead of .*.

cut command in bash terminating on quotation marks

So I am trying to read in a file that has a bunch of lines with an email address and then a nickname in them. I am trying to extract this nickname, which is surrounded by parentheses, like below
email#somewhere.com (Tom)
so my thought was just to use cut to get at the word Tom, but this is foiled when I end up with something like the following
email2#somewhereElse.com ("Bob")
Because Bob has quotes around it, the cut command fails as follows
cut: <file>: Illegal byte sequence
Does anyone know of a better way of doing this? or a way to solve this problem?
Reset your locale to C (raw uninterpreted byte sequence) to avoid Illegal byte sequence errors.
locale charmap
LC_ALL=C cut ... | LC_ALL=C sort ...
I think that
grep -o '(.*)' emailFile
should do it. "Go through all lines in the file. Look for a sequence that starts with open parens, then any characters until close parens. Echo the bit that matches the string to stdout."
This preserves the quotes around the nickname... as well as the brackets. If you don't want those, you can strip them:
grep -o '(.*)' emailFile | sed 's/[(")]//g'
("replace any of the characters between square brackets with nothing, everywhere")
perl -lne '$_=~/[^\(]*\(([^)]*)\)/g;print $1'
tested here

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Resources