How to create a hex dump of file containing only the hex characters without spaces in bash? - linux

How do I create an unmodified hex dump of a binary file in Linux using bash? The od and hexdump commands both insert spaces in the dump and this is not ideal.
Is there a way to simply write a long string with all the hex characters, minus spaces or newlines in the output?

xxd -p file
Or if you want it all on a single line:
xxd -p file | tr -d '\n'

Format strings can make hexdump behave exactly as you want it to (no whitespace at all, byte by byte):
hexdump -ve '1/1 "%.2x"'
1/1 means "each format is applied once and takes one byte", and "%.2x" is the actual format string, like in printf. In this case: 2-character hexadecimal number, leading zeros if shorter.

It seems to depend on the details of the version of od. On OSX, use this:
od -t x1 -An file |tr -d '\n '
(That's print as type hex bytes, with no address. And whitespace deleted afterwards, of course.)

Perl one-liner:
perl -e 'local $/; print unpack "H*", <>' file

The other answers are preferable, but for a pure Bash solution, I've modified the script in my answer here to be able to output a continuous stream of hex characters representing the contents of a file. (Its normal mode is to emulate hexdump -C.)

I think this is the most widely supported version (requiring only POSIX defined tr and od behavior):
cat "$file" | od -v -t x1 -A n | tr -d ' \n'
This uses od to print each byte as hex without address without skipping repeated bytes and tr to delete all spaces and linefeeds in the output. Note that not even the trailing linefeed is emitted here. (The cat is intentional to allow multicore processing where cat can wait for filesystem while od is still processing previously read part. Single core users may want replace that with < "$file" od ... to save starting one additional process.)

tldr;
$ od -t x1 -A n -v <empty.zip | tr -dc '[:xdigit:]' && echo
504b0506000000000000000000000000000000000000
$
Explanation:
Use the od tool to print single hexadecimal bytes (-t x1) --- without address offsets (-A n) and without eliding repeated "groups" (-v) --- from empty.zip, which has been redirected to standard input. Pipe that to tr which deletes (-d) the complement (-c) of the hexadecimal character set ('[:xdigit:]'). You can optionally print a trailing newline (echo) as I've done here to separate the output from the next shell prompt.
References:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html

This code produces a "pure" hex dump string and it runs faster than the all the
other examples given.
It has been tested on 1GB files filled with binary zeros, and all linefeeds.
It is not data content dependent and reads 1MB records instead of lines.
perl -pe 'BEGIN{$/=\1e6} $_=unpack "H*"'
Dozens of timing tests show that for 1GB files, these other methods below are slower.
All tests were run writing output to a file which was then verified by checksum.
Three 1GB input files were tested: all bytes, all binary zeros, and all LFs.
hexdump -ve '1/1 "%.2x"' # ~10x slower
od -v -t x1 -An | tr -d "\n " # ~15x slower
xxd -p | tr -d \\n # ~3x slower
perl -e 'local \$/; print unpack "H*", <>' # ~1.5x slower
- this also slurps the whole file into memory
To reverse the process:
perl -pe 'BEGIN{$/=\1e6} $_=pack "H*",$_'

You can use Python for this purpose:
python -c "print(open('file.bin','rb').read().hex())"
...where file.bin is your filename.
Explaination:
Open file.bin in rb (read binary) mode.
Read contents (returned as bytes object).
Use bytes method .hex(), which returns hex dump without spaces or new lines.
Print output.

Related

bash and awk extract string at specific position in non-utf file

I've a file foo.txt that is encoded with chatset ISO-8859-1.
I am doing some field extraction with awk, based in a specific position.
e.g at each line, extract a string that starts in pos 10 with length 5.
That is a simple task, however the below command has different behaviors in different Linux Machines (with different bash/awk versions).
In Machine 1 OK, Machine 2 NOT ok:
cat foo.dat | iconv -f ISO-8859-1 -t UTF-8 | awk '{print substr($0, 10,5)}' > results.utf8
In Machine 1 NOT ok, Machine 2 OK:
cat foo.dat | awk '{print substr($0, 10,5)}' | iconv -f ISO-8859-1 -t UTF-8 > results.utf8
If I run the same command with the same input file, the results are different on each line that contains a "non-utf" char like (a▒c) before the 'cut' position".
No idea where the issue is, linux Kernel, bash or awk version... and specially how to have a common way to extract the desired strings...
No idea where the issue is, linux Kernel, bash or awk version...
The GNU Awk User's Guide - Bytes vs. Characters claims that
The POSIX standard requires that awk function in terms of characters,
not bytes. Thus in gawk, length(), substr(), split(),
match() and the other string functions (...) all work in terms of
characters in the local character set, and not in terms of bytes. (Not
all awk implementations do so, though).
If above hold true then answer how to have a common way to extract the desired strings is to use AWK implementation compliant with POSIX (or at least who respect above rule to work in terms of characters, not bytes) and to make sure local character set is as desired.
One option is to use a language which only has one implementation and where you can turn off UTF-8 (or rather, fail to turn it on).
It's not entirely clear what you expect the output to be, but I'm guessing you want something like this:
perl -lne 'print substr($_, 9, 5)' foo.dat | iconv -f ISO-8859-1 -t UTF-8
Notice how the conversion only happens after the extraction, so you can be sure that each byte is exactly one character.

Find files with non-printing characters (null bytes)

I have got the log of my application with a field that contains strange characters.
I see these characters only when I use less command.
I tried to copy the result of my line of code in a text file and what I see is
CTP_OUT=^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
I'd like to know if there is a way to find these null characters. I have tried with a grep command but it didn't show anything
I hardly believe it, I might write an answer involving cat!
The characters you are observing are non-printable characters which are often written in Carret notation. The Caret notation of a character is a way to visualize non-printable characters. As mentioned in the OP, ^# is the representation of NULL.
If your file has non-printable characters, you can visualize them using cat -vET:
-E, --show-ends: display $ at end of each line
-T, --show-tabs: display TAB characters as ^I
-v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB
source: man cat
I've added the -E and -T flag to it, to convert everything non-printable.
As grep will not output the non-printable characters itself in any form, you have to pipe its output to cat to see them. The following example shows all lines containing non-printable characters
Show all lines with non-printable characters:
$ grep -E '[^[:print:]]' --color=never file | cat -vET
Here, the ERE [^[:print:]] selects all non-printable characters.
Show all lines with NULL:
$ grep -Pa '\x00' --color=never file | cat -vET
Be aware that we need to make use of the Perl regular expressions here as they understand the hexadecimal and octal notation.
Various control characters can be written in C language style: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc.
More generally, \nnn, where nnn is a string of three octal digits, matches the character whose native code point is nnn. You can easily run into trouble if you don't have exactly three digits. So always use three, or since Perl 5.14, you can use \o{...} to specify any number of octal digits.
Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.
source: Perl 5 version 26.1 documentation
An example:
$ printf 'foo\012\011\011bar\014\010\012foobar\012\011\000\013\000car\012\011\011\011\012' > test.txt
$ cat test.txt
foo
bar
foobar
car
If we now use grep alone, we get the following:
$ grep -Pa '\x00' --color=never test.txt
car
But piping it to cat allows us to visualize the control characters:
$ grep -Pa '\x00' --color=never test.txt | cat -vET
^I^#^K^#car$
Why --color=never: If your grep is tuned to have --color=auto or --color=always it will add extra control characters to be interpreted as color for the terminal. And this might confuse you by the content.
$ grep -Pa '\x00' --color=always test.txt | cat -vET
^I^[[01;31m^[[K^#^[[m^[[K^K^[[01;31m^[[K^#^[[m^[[Kcar$
sed can.
sed -n '/\x0/ { s/\x0/<NUL>/g; p}' file
-n skips printing any output unless explicitly requested.
/\x0/ selects for only lines with null bytes.
{...} encapsulates multiple commands, so that they can be collectively applied always and only when the /\x0/ has detected a null on the line.
s/\x0/<NUL>/g; substitutes in a new, visible value for the null bytes. You could make it whatever you want - I used <NUL> as something both reasonably obvious and yet unlikely to occur otherwise. You should probably grep the file for it first to be sure the pattern doesn't exist before using it.
p; causes lines that have been edited (because they had a null byte) to show.
This basically makes sed an effective grep for nulls.

Strip bash script from beginning of gzip file

I have a series of files which are comprised of a bash script, at the end of which a gzip file has been concatenated.
I would like a method of stripping off the leading bash, to leave a pure gzip file.
The method I have come up with is to:
Do a hex dump on the file;
Use sed to remove everything before the gzip magic number 1f 8b;
Convert the remaining hex dump back to binary.
i.e.
xxd -c1 -p input | tr "\n" " " | sed 's/^.*?1f 8b/1f 8b' | xxd -r -p > output
This appears to work okay on first glance. However, it would fall apart if the gzip portion of the file happens to contain the byte sequence 1f 8b apart from in the initial header. In these cases it deletes everything before the last occurrence.
Is my initial attempt on the right track, and what can I do to fix it? Or is there a much better way to do this that I have missed?
I would use the sed line range functionality to accomplish this. -n suppresses normal printing, and the range /\x1f\x8b/,$ will match every line after and including the first one with \x1f\x8b in it and print them out.
sed -n '/\x1f\x8b/,$ p'
Alternatively, depending on your tastes, you can add a text marker "### BEGIN GZIP DATA ###" and delete everything before and including it:
sed '1,/### BEGIN GZIP DATA ###/ d'
Perl solution. It sets the record separator to the magic sequence and prints all the records except the first one. The magic sequence must be prepended at the beginning, otherwise, it would be lost together with the bash script, which is the first record.
perl -ne 'BEGIN { $/ = "\x1f\x8b"; print $/; } print if $. != 1' input > output.gz

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Counting number of characters in a file through shell script

I want to check the no of characters in a file from starting to EOF character. Can anyone tell me how to do this through shell script
This will do it for counting bytes in file:
wc -c filename
If you want only the count without the filename being repeated in the output:
wc -c < filename
This will count characters in multibyte files (Unicode etc.):
wc -m filename
(as shown in Sébastien's answer).
#!/bin/sh
wc -m $1 | awk '{print $1}'
wc -m counts the number of characters; the awk command prints the number of characters only, omitting the filename.
wc -c would give you the number of bytes (which can be different to the number of characters, as depending on the encoding you may have a character encoded on several bytes).
To get exact character count of string, use printf, as opposed to echo, cat, or running wc -c directly on a file, because using echo, cat, etc will count a newline character, which will give you the amount of characters including the newline character. So a file with the text 'hello' will print 6 if you use echo etc, but if you use printf it will return the exact 5, because theres no newline element to count.
How to use printf for counting characters within strings:
$printf '6chars' | wc -m
6
To turn this into a script you can run on a text file to count characters, save the following in a file called print-character-amount.sh:
#!/bin/bash
characters=$(cat "$1")
printf "$characters" | wc -m
chmod +x on file print-character-amount.sh containing above text, place the file in your PATH (i.e. /usr/bin/ or any directory exported as PATH in your .bashrc file) then to run script on text file type:
print-character-amount.sh file-to-count-characters-of.txt
awk '{t+=length($0)}END{print t}' file3
awk only
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)c++}END{print "total chars:"c}' file
shell only
var=$(<file)
echo ${#var}
Ruby(1.9+)
ruby -0777 -ne 'print $_.size' file
The following script is tested and gives exactly the results, that are expected
\#!/bin/bash
echo "Enter the file name"
read file
echo "enter the word to be found"
read word
count=0
for i in \`cat $file`
do
if [ $i == $word ]
then
count=\`expr $count + 1`
fi
done
echo "The number of words are $count"
I would have thought that it would be better to use stat to find the size of a file, since the filesystem knows it already, rather than causing the whole file to have to be read with awk or wc - especially if it is a multi-GB file or one that may be non-resident in the file-system on an HSM.
stat -c%s file
Yes, I concede it doesn't account for multi-byte characters, but would add that the OP has never clarified whether that is/was an issue.
Credits to user.py et al.
echo "ää" > /tmp/your_file.txt
cat /tmp/your_file.txt | wc -m
results in 3.
In my example the result is expected to be 2 (twice the letter ä). However, echo (or vi) adds a line break \n to the end of the output (or file). So two ä and one Linux line break \n are counted. That's three together.
Working with pipes | is not the shortest variant, but so I have to know less wc parameters by heart. In addition, cat is bullet-proof in my experience.
Tested on Ubuntu 18.04.1 LTS (Bionic Beaver).

Resources