Why do seemingly empty files and strings produce md5sums?

Why do seemingly empty files and strings produce md5sums? - string

Consider the following:
% md5sum /dev/null
d41d8cd98f00b204e9800998ecf8427e /dev/null
% touch empty; md5sum empty
d41d8cd98f00b204e9800998ecf8427e empty
% echo '' | md5sum
68b329da9893e34099c7d8ad5cb9c940 -
% perl -e 'print chr(0)' | md5sum
93b885adfe0da089cdf634904fd59f71 -
% md5sum ''
md5sum: : No such file or directory
First of all, I'm surprised by the output of all these commands. If anything, I would expect the sum to be the same for all of them.

The md5sum of "nothing" (a zero-length stream of characters) is d41d8cd98f00b204e9800998ecf8427e, which you're seeing in your first two examples.
The third and fourth examples are processing a single character. In the "echo" case, it's a newline, i.e.
$ echo -ne '\n' | md5sum
68b329da9893e34099c7d8ad5cb9c940 -
In the perl example, it's a single byte with value 0x00, i.e.
$ echo -ne '\x00' | md5sum
93b885adfe0da089cdf634904fd59f71 -
You can reproduce the empty checksum using "echo" as follows:
$ echo -n '' | md5sum
d41d8cd98f00b204e9800998ecf8427e -
...and using Perl as follows:
$ perl -e 'print ""' | md5sum
d41d8cd98f00b204e9800998ecf8427e -
In all four cases, you should expect the same output from checksumming the same data, but different data should produce a wildly different checksum (that's the whole point -- even if it's only a single character that differs.)

Why do seemingly empty files and strings produce md5sums?
Because the "sum" in the md5sum is somewhat misleading. It's not like e.g. CRC32 checksum, that is zero for the empty file.
MD5 is one of message digest algorithms. You can imagine it as a box that produces fixed-length random-looking value (hash) depending on its internal state. You change the internal state by feeding in the data.
And that box internal state is predefined, such that that it yields randomly looking hash value even before any data is fed in. For MD5, it happens to be d41d8cd98f00b204e9800998ecf8427e.

No need for surprise. The first two produce true empty inputs to md5sum. The echo produces a newline (echo -n '' should produce an empty output; I don't have a linux machine here to check). The perl produces a single zero byte (not to be confused with C where a zero byte marks end of string). The last command is looking for a file with the empty string as its file name.

Related

Iterate over grep output in multi-line chunks

I can't figure out how to iterate by match from grep output when using -A or -B flags (returning multiple lines for a single match). I'm not opposed to using awk for this, but this will be used to check multiple large files so the shorter execution time, the better.
I figured I could achieve this by using --group-separator flag of grep and setting IFS to the same character, but this doesn't seem to be the case.
$ grep --group-separator=# -B2 'locked' thisisatest | while IFS=# read -r match; do
echo "a"
echo "${match}"
done
a
this is a test file asdflksdklsdklsdkl.txt
a
this is meaningless
a
this is locked
a
a
this is a test file basdflksdklsdklsdkl.txt
a
this is meaningless
a
this is locked
$ cat thisisatest
this is a test file asdflksdklsdklsdkl.txt
this is meaningless
this is locked
j
j
j
j
j
this is a test file basdflksdklsdklsdkl.txt
this is meaningless
this is locked
I expect "a" to be printed before and after the entirety of a match such as:
a
this is a test file asdflksdklsdklsdkl.txt
this is meaningless
this is locked
a

Changing $IFS controls how read splits a line into fields. You're only having it fill out one field at a time ($match), so the field delimiter doesn't matter. Change the line delimiter instead: use read -d.
grep ... | while read -d# -r match; do

Check file's whole raw text content exists as part of another file

big1.txt:
a
b
c
d
e
big2.txt:
f
c
g
h
i
b
small.txt:
b
c
Within a bash script, how can I tell that the whole ordered content of small.txt exists in another file?
Example:
??? small.txt big1.txt should return true
??? small.txt big2.txt should return false

If big1.txt and big2.txt not too big (can be loaded in memory). Following test may be sufficient.
# to store file content into variables
big1=$(< big1.txt)
big2=$(< big2.txt)
small=$(< small.txt)
# to run from test case
big1=$'a\nb\nc\nd\ne\n'
big2=$'f\nc\ng\nh\ni\nb\n'
small=$'b\nc\n'
if [[ ${big1} = *${small}* ]]; then echo "big1"; fi
if [[ ${big2} = *${small}* ]]; then echo "big2"; fi

Sometimes the way to discover that two complicated things are 'equal' is to do some cheap test that is true if they are equal and rarely true if they are not. Those that pass this hueristic test are then checked more carefully ... but only rarely so full equality test can be expensive and yet not be triggered on every compare.
What I would do in this circumstance is take all the files, and sort their lines. (You might want to suppress blank lines if you are looking for matching text, and strip lines with trailing blanks, but that's your choice). Probably useful to delete duplicate lines.
Now compare each file, to all longer files to see if it is a prefix. (Can't be a prefix if the other file is shorter, thus we get rid of 1/2 the compares just based on sizes). If sorted file A is a prefix of sorted file B, then you can run a more complicated test to see if the real file A is embeded in file B (which I suspect will be true with high probability if the sorted files pass the prefix test).
Having had this idea, we can now optimize it. Instead of storing lines of text, we take each file, and hash each line, giving a file of hash codes. Sort these. Follow the rest of the procedure.
Next trick: decide our hash codes are 8 bits or 16 bits in size. This makes them fit into a character of your favorite programming language. Now your prefix-compare test can consist of collecting the character-sized hash codes per file, and doing a string compare of shorter ones against longer ones. At this point we've moved the problem from reading the disk to comparing efficiently in memory; we probably can't speed it up much because disk reads are very expensive compared to in memory computations.

$ diff small big1.txt | grep -q '^<'
$ echo $?
1
$ diff small big2.txt | grep -q '^<'
$ echo $?
0
$ ! (diff small big1.txt | grep -q '^<')
$ echo $?
0
$ ! (diff small big2.txt | grep -q '^<')
$ echo $?
1
$ if diff small big1.txt | grep -q '^<'; then echo "does not exit"; else echo "does exist"; fi
does exist
$ if diff small big2.txt | grep -q '^<'; then echo "does not exit"; else echo "does exist"; fi
does not exit

check this please
if perl -0777 -e '$n = <>; $h = <>; exit(index($h,$n)<0)' small.txt big.txt
then echo small.txt is found in big.txt
fi

How to loop an executable command in the terminal in Linux?

Let me first describe my situation, I am working on a Linux platform and have a collection of .bmp files that add one to the picture number from filename0022.bmp up to filename0680.bmp. So a total of 658 pictures. I want to be able to run each of these pictures through a .exe file that operates on the picture then kicks out the file to a file specified by the user, it also has some threshold arguments: lower, upper. So the typical call for the executable is:
./filter inputfile outputfile lower upper
Is there a way that I can loop this call over all the files just from the terminal or by creating some kind of bash script? My problem is similar to this: Execute a command over multiple files with a batch file but this time I am working in a Linux command line terminal.

You may be interested in looking into bash scripting.
You can execute commands in a for loop directly from the shell.
A simple loop to generate the numbers you specifically mentioned. For example, from the shell:
user#machine $ for i in {22..680} ; do
> echo "filename${i}.bmp"
> done
This will give you a list from filename22.bmp to filename680.bmp. That simply handles the iteration of the range you had mentioned. This doesn't cover zero padding numbers. To do this you can use printf. The printf syntax is printf format argument. We can use the $i variable from our previous loop as the argument and apply the %Wd format where W is the width. Prefixing the W placeholder will specify the character to use. Example:
user#machine $ for i in {22..680} ; do
> echo "filename$(printf '%04d' $i).bmp"
> done
In the above $() acts as a variable, executing commands to obtain the value opposed to a predefined value.
This should now give you the filenames you had specified. We can take that and apply it to the actual application:
user#machine $ for i in {22..680} ; do
> ./filter "filename$(printf '%04d' $i).bmp" lower upper
> done
This can be rewritten to form one line:
user#machine $ for i in {22..680} ; do ./filter "filename$(printf '%04d' $i).bmp" lower upper ; done
One thing to note from the question, .exe files are generally compiled in COFF format where linux expects an ELF format executable.

here is a simple example:
for i in {1..100}; do echo "Hello Linux Terminal"; done
to append to a file:(>> is used to append, you can also use > to overwrite)
for i in {1..100}; do echo "Hello Linux Terminal" >> file.txt; done

You can try something like this...
#! /bin/bash
for ((a=022; a <= 658 ; a++))
do
printf "./filter filename%04d.bmp outputfile lower upper" $a | "sh"
done

You can leverage xargs for iterating:
ls | xargs -i ./filter {} {}_out lower upper
Note:
{} corresponds to one line output from the pipe, here it's the inputfile name.
Output files wouldbe named with postfix '_out'.

You can test that AS-IS in your shell :
for i in *; do
echo "$i" | tr '[:lower:]' '[:upper:]'
done
If you have a special path, change * by your path + a glob : Ex :
for i in /home/me/*.exe; do ...
See http://mywiki.wooledge.org/glob

This while prepend the name of the output images like filtered_filename0055.bmp
for i in *; do
./filter $i filtered_$i lower upper
done

Counting number of characters in a file through shell script

I want to check the no of characters in a file from starting to EOF character. Can anyone tell me how to do this through shell script

This will do it for counting bytes in file:
wc -c filename
If you want only the count without the filename being repeated in the output:
wc -c < filename
This will count characters in multibyte files (Unicode etc.):
wc -m filename
(as shown in Sébastien's answer).

#!/bin/sh
wc -m $1 | awk '{print $1}'
wc -m counts the number of characters; the awk command prints the number of characters only, omitting the filename.
wc -c would give you the number of bytes (which can be different to the number of characters, as depending on the encoding you may have a character encoded on several bytes).

To get exact character count of string, use printf, as opposed to echo, cat, or running wc -c directly on a file, because using echo, cat, etc will count a newline character, which will give you the amount of characters including the newline character. So a file with the text 'hello' will print 6 if you use echo etc, but if you use printf it will return the exact 5, because theres no newline element to count.
How to use printf for counting characters within strings:
$printf '6chars' | wc -m
6
To turn this into a script you can run on a text file to count characters, save the following in a file called print-character-amount.sh:
#!/bin/bash
characters=$(cat "$1")
printf "$characters" | wc -m
chmod +x on file print-character-amount.sh containing above text, place the file in your PATH (i.e. /usr/bin/ or any directory exported as PATH in your .bashrc file) then to run script on text file type:
print-character-amount.sh file-to-count-characters-of.txt

awk '{t+=length($0)}END{print t}' file3

awk only
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)c++}END{print "total chars:"c}' file
shell only
var=$(<file)
echo ${#var}
Ruby(1.9+)
ruby -0777 -ne 'print $_.size' file

The following script is tested and gives exactly the results, that are expected
\#!/bin/bash
echo "Enter the file name"
read file
echo "enter the word to be found"
read word
count=0
for i in \`cat $file`
do
if [ $i == $word ]
then
count=\`expr $count + 1`
fi
done
echo "The number of words are $count"

I would have thought that it would be better to use stat to find the size of a file, since the filesystem knows it already, rather than causing the whole file to have to be read with awk or wc - especially if it is a multi-GB file or one that may be non-resident in the file-system on an HSM.
stat -c%s file
Yes, I concede it doesn't account for multi-byte characters, but would add that the OP has never clarified whether that is/was an issue.

Credits to user.py et al.
echo "ää" > /tmp/your_file.txt
cat /tmp/your_file.txt | wc -m
results in 3.
In my example the result is expected to be 2 (twice the letter ä). However, echo (or vi) adds a line break \n to the end of the output (or file). So two ä and one Linux line break \n are counted. That's three together.
Working with pipes | is not the shortest variant, but so I have to know less wc parameters by heart. In addition, cat is bullet-proof in my experience.
Tested on Ubuntu 18.04.1 LTS (Bionic Beaver).

How to create a hex dump of file containing only the hex characters without spaces in bash?

How do I create an unmodified hex dump of a binary file in Linux using bash? The od and hexdump commands both insert spaces in the dump and this is not ideal.
Is there a way to simply write a long string with all the hex characters, minus spaces or newlines in the output?

xxd -p file
Or if you want it all on a single line:
xxd -p file | tr -d '\n'

Format strings can make hexdump behave exactly as you want it to (no whitespace at all, byte by byte):
hexdump -ve '1/1 "%.2x"'
1/1 means "each format is applied once and takes one byte", and "%.2x" is the actual format string, like in printf. In this case: 2-character hexadecimal number, leading zeros if shorter.

It seems to depend on the details of the version of od. On OSX, use this:
od -t x1 -An file |tr -d '\n '
(That's print as type hex bytes, with no address. And whitespace deleted afterwards, of course.)

Perl one-liner:
perl -e 'local $/; print unpack "H*", <>' file

The other answers are preferable, but for a pure Bash solution, I've modified the script in my answer here to be able to output a continuous stream of hex characters representing the contents of a file. (Its normal mode is to emulate hexdump -C.)

I think this is the most widely supported version (requiring only POSIX defined tr and od behavior):
cat "$file" | od -v -t x1 -A n | tr -d ' \n'
This uses od to print each byte as hex without address without skipping repeated bytes and tr to delete all spaces and linefeeds in the output. Note that not even the trailing linefeed is emitted here. (The cat is intentional to allow multicore processing where cat can wait for filesystem while od is still processing previously read part. Single core users may want replace that with < "$file" od ... to save starting one additional process.)

tldr;
$ od -t x1 -A n -v <empty.zip | tr -dc '[:xdigit:]' && echo
504b0506000000000000000000000000000000000000
$
Explanation:
Use the od tool to print single hexadecimal bytes (-t x1) --- without address offsets (-A n) and without eliding repeated "groups" (-v) --- from empty.zip, which has been redirected to standard input. Pipe that to tr which deletes (-d) the complement (-c) of the hexadecimal character set ('[:xdigit:]'). You can optionally print a trailing newline (echo) as I've done here to separate the output from the next shell prompt.
References:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html

This code produces a "pure" hex dump string and it runs faster than the all the
other examples given.
It has been tested on 1GB files filled with binary zeros, and all linefeeds.
It is not data content dependent and reads 1MB records instead of lines.
perl -pe 'BEGIN{$/=\1e6} $_=unpack "H*"'
Dozens of timing tests show that for 1GB files, these other methods below are slower.
All tests were run writing output to a file which was then verified by checksum.
Three 1GB input files were tested: all bytes, all binary zeros, and all LFs.
hexdump -ve '1/1 "%.2x"' # ~10x slower
od -v -t x1 -An | tr -d "\n " # ~15x slower
xxd -p | tr -d \\n # ~3x slower
perl -e 'local \$/; print unpack "H*", <>' # ~1.5x slower
- this also slurps the whole file into memory
To reverse the process:
perl -pe 'BEGIN{$/=\1e6} $_=pack "H*",$_'

You can use Python for this purpose:
python -c "print(open('file.bin','rb').read().hex())"
...where file.bin is your filename.
Explaination:
Open file.bin in rb (read binary) mode.
Read contents (returned as bytes object).
Use bytes method .hex(), which returns hex dump without spaces or new lines.
Print output.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string