Fastest way to tell if two files have the same contents in Unix/Linux? - linux

I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diff command seems to be the performance bottleneck.
Here's the line:
diff -q $dst $new > /dev/null
if ($status) then ...
Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff?

I believe cmp will stop at the first byte difference:
cmp --silent $old $new || echo "files are different"

I like #Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:
cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'
I can then run this in the terminal or with a ssh to check files against a constant file.

To quickly and safely compare any two files:
if cmp --silent -- "$FILE1" "$FILE2"; then
echo "files contents are identical"
else
echo "files differ"
fi
It's readable, efficient, and works for any file names including "` $()

Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.
But, if you are going to use the cmp command (and don't need/want to be verbose) you can just grab the exit status. Per the cmp man page:
If a FILE is '-' or missing, read standard input. Exit status is 0
if inputs are the same, 1 if different, 2 if trouble.
So, you could do something like:
STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)" # "$?" gives exit status for each comparison
if [[ $STATUS -ne 0 ]]; then # if status isn't equal to 0, then execute code
DO A COMMAND ON $FILE1
else
DO SOMETHING ELSE
fi
EDIT: Thanks for the comments everyone! I updated the test syntax here. However, I would suggest you use Vasili's answer if you are looking for something similar to this answer in readability, style, and syntax.

You can compare by checksum algorithm like sha256
sha256sum oldFile > oldFile.sha256
echo "$(cat oldFile.sha256) newFile" | sha256sum --check
newFile: OK
if the files are distinct the result will be
newFile: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

For files that are not different, any method will require having read both files entirely, even if the read was in the past.
There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.
File metadata retrieval is much faster than reading a large file.
So, is there any file metadata you can use to establish that the files are different?
File size ? or even results of the file command which does just read a small portion of the file?
File size example code fragment:
ls -l $1 $2 |
awk 'NR==1{a=$5} NR==2{b=$5}
END{val=(a==b)?0 :1; exit( val) }'
[ $? -eq 0 ] && echo 'same' || echo 'different'
If the files are the same size then you are stuck with full file reads.

Try also to use the cksum command:
chk1=`cksum <file1> | awk -F" " '{print $1}'`
chk2=`cksum <file2> | awk -F" " '{print $1}'`
if [ $chk1 -eq $chk2 ]
then
echo "File is identical"
else
echo "File is not identical"
fi
The cksum command will output the byte count of a file. See 'man cksum'.

Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:
[root#mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ
real 0m0.008s
user 0m0.008s
sys 0m0.000s
diff false
real 0m0.009s
user 0m0.007s
sys 0m0.001s
cmp false
cp: overwrite âtest.copyâ? y
real 0m0.966s
user 0m0.447s
sys 0m0.518s
diff true
real 0m0.785s
user 0m0.211s
sys 0m0.573s
cmp true
[root#mypi shm]# pico /root/rwbscripts/utils/squish.sh
I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....
identical (){
echo "$1" and "$2" are the same.
echo This is a function, you can put whatever you want in here.
}
different () {
echo "$1" and "$2" are different.
echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"

If you are looking for more customizable diff for this, then git diff can be used.
if (git diff --no-index --quiet -- old.txt new.txt) then
echo "files contents are identical"
else
echo "files differ"
fi
--quiet
Disable all output of the program. Implies --exit-code.
--exit-code
Make the program exit with codes similar to diff(1). That is, it exits with 1 if there were differences and 0 means no differences.
Also, there are various algorithms and settings to choose from: [ref]
--diff-algorithm={patience|minimal|histogram|myers}
Choose a diff algorithm. The variants are as follows:
default, myers The basic greedy diff algorithm. Currently, this is the
default.
minimal Spend extra time to make sure the smallest possible diff is
produced.
patience Use "patience diff" algorithm when generating patches.
histogram This algorithm extends the patience algorithm to "support
low-occurrence common elements".

Related

Using inotifywait to process two files in parallel

I am using:
inotifywait -m -q -e close_write --format %f . | while IFS= read -r file; do
cp -p "$file" /path/to/other/directory
done
to monitor a folder for file completion, then moving it to another folder.
Files are made in pairs but at separate times, ie File1_001.txt is made at 3pm, File1_002.txt is made at 9pm. I want to monitor for the completion of BOTH files, then launch a script.
script.sh File1_001.txt File1_002.txt
So I need to have another inotifywait command or a different utility, that can also identify that both files are present and completed, then start the script.
Does anyone know how to solve this problem?
I found a Linux box with inotifywait installed on it, so now I understand what it does and how it works. :)
Is this what you need?
#!/bin/bash
if [ "$1" = "-v" ]; then
Verbose=true
shift
else
Verbose=false
fi
file1="$1"
file2="$2"
$Verbose && printf 'Waiting for %s and %s.\n' "$file1" "$file2"
got1=false
got2=false
while read thisfile; do
$Verbose && printf ">> $thisfile"
case "$thisfile" in
$file1) got1=true; $Verbose && printf "... it's a match!" ;;
$file2) got2=true; $Verbose && printf "... it's a match!" ;;
esac
$Verbose && printf '\n'
if $got1 && $got2; then
$Verbose && printf 'Saw both files.\n'
break
fi
done < <(inotifywait -m -q -e close_write --format %f .)
This runs a single inotifywait but parses its output in a loop that exits when both files on the command line ($1 and $2) are seen to have been updated.
Note that if one file is closed and then later is reopened while the second file is closed, this script obviously will not detect the open file. But that may not be a concern in your use case.
Note that there are many ways of building a solution -- I've shown you only one.

How to extract only file name return from diff command?

I am trying to prepare a bash script for sync 2 directories. But I am not able to file name return from diff. everytime it converts to array.
Here is my code :
#!/bin/bash
DIRS1=`diff -r /opt/lampp/htdocs/scripts/dev/ /opt/lampp/htdocs/scripts/www/ `
for DIR in $DIRS1
do
echo $DIR
done
And if I run this script I get out put something like this :
Only
in
/opt/lampp/htdocs/scripts/www/:
file1
diff
-r
"/opt/lampp/htdocs/scripts/dev/File
1.txt"
"/opt/lampp/htdocs/scripts/www/File
1.txt"
0a1
>
sa
das
Only
in
/opt/lampp/htdocs/scripts/www/:
File
1.txt~
Only
in
/opt/lampp/htdocs/scripts/www/:
file
2
-
second
Actually I just want to file name where I find the diffrence so I can take perticular action either copy/delete.
Thanks
I don't think diff produces output which can be parsed easily for your purposes. It's possible to solve your problem by iterating over the files in the two directories and running diff on them, using the return value from diff instead (and throwing the diff output away).
The code to do this is a bit long, but here it is:
DIR1=./one # set as required
DIR2=./two # set as required
# Process any files in $DIR1 only, or in both $DIR1 and $DIR2
find $DIR1 -type f -print0 | while read -d $'\0' -r file1; do
relative_path=${file1#${DIR1}/};
file2="$DIR2/$relative_path"
if [[ ! -f "$file2" ]]; then
echo "'$relative_path' in '$DIR1' only"
# Do more stuff here
elif diff -q "$file1" "$file2" >/dev/null; then
echo "'$relative_path' same in '$DIR1' and '$DIR2'"
# Do more stuff here
else
echo "'$relative_path' different between '$DIR1' and '$DIR2'"
# Do more stuff here
fi
done
# Process files in $DIR2 only
find $DIR2 -type f -print0 | while read -d $'\0' -r file2; do
relative_path=${file2#${DIR2}/};
file1="$DIR1/$relative_path"
if [[ ! -f "$file2" ]]; then
echo "'$relative_path' in '$DIR2 only'"
# Do more stuff here
fi
done
This code leverages some tricks to safely handle files which contain spaces, which would be very difficult to get working by parsing diff output. You can find more details on that topic here.
Of course this doesn't do anything regarding files which have the same contents but different names or are located in different directories.
I tested by populating two test directories as follows:
echo "dir one only" > "$DIR1/dir one only.txt"
echo "dir two only" > "$DIR2/dir two only.txt"
echo "in both, same" > $DIR1/"in both, same.txt"
echo "in both, same" > $DIR2/"in both, same.txt"
echo "in both, and different" > $DIR1/"in both, different.txt"
echo "in both, but different" > $DIR2/"in both, different.txt"
My output was:
'dir one only.txt' in './one' only
'in both, different.txt' different between './one' and './two'
'in both, same.txt' same in './one' and './two'
Use -q flag and avoid the for loop:
diff -rq /opt/lampp/htdocs/scripts/dev/ /opt/lampp/htdocs/scripts/www/
If you only want the files that differs:
diff -rq /opt/lampp/htdocs/scripts/dev/ /opt/lampp/htdocs/scripts/www/ |grep -Po '(?<=Files )\w+'|while read file; do
echo $file
done
-q --brief
Output only whether files differ.
But defitnitely you should check rsync: http://linux.die.net/man/1/rsync

bash script - I want to check if XLS is empty. if it is, i don't want to do anything. If it is not, I want to do something

I have a bash script that has an if-then-fi statement included. the code block executes only when the XLS is not empty. Currently i'm evaluating this by utilizing the following:
FILESIZE = `wc -c < $FILENAME`
it seems that the default filesize generated is 4096 bytes if the file is empty. So...
if [ $FILESIZE -gt "4096" ]; then
do something
fi
however, my boss isn't a huge fan of hard coded numbers. is there an alternative solution to seeing whether an XLS has data?
thanks!
if [ -r "$FILENAME ] # If there is a readable file "$FILENAME"
then
if [ -s "$FILENAME" ] # If file "$FILENAME" has a size greater than zero bytes
then
do something
fi
fi
You could to use xls2csv command, if result is 0 the file is empty.
xls2csv file.xls | wc -l
This command it's usually in the "catdoc" package.

Bash: Create a file if it does not exist, otherwise check to see if it is writeable

I have a bash program that will write to an output file. This file may or may not exist, but the script must check permissions and fail early. I can't find an elegant way to make this happen. Here's what I have tried.
set +e
touch $file
set -e
if [ $? -ne 0 ]; then exit;fi
I keep set -e on for this script so it fails if there is ever an error on any line. Is there an easier way to do the above script?
Why complicate things?
file=exists_and_writeable
if [ ! -e "$file" ] ; then
touch "$file"
fi
if [ ! -w "$file" ] ; then
echo cannot write to $file
exit 1
fi
Or, more concisely,
( [ -e "$file" ] || touch "$file" ) && [ ! -w "$file" ] && echo cannot write to $file && exit 1
Rather than check $? on a different line, check the return value immediately like this:
touch file || exit
As long as your umask doesn't restrict the write bit from being set, you can just rely on the return value of touch
You can use -w to check if a file is writable (search for it in the bash man page).
if [[ ! -w $file ]]; then exit; fi
Why must the script fail early? By separating the writable test and the file open() you introduce a race condition. Instead, why not try to open (truncate/append) the file for writing, and deal with the error if it occurs? Something like:
$ echo foo > output.txt
$ if [ $? -ne 0 ]; then die("Couldn't echo foo")
As others mention, the "noclobber" option might be useful if you want to avoid overwriting existing files.
Open the file for writing. In the shell, this is done with an output redirection. You can redirect the shell's standard output by putting the redirection on the exec built-in with no argument.
set -e
exec >shell.out # exit if shell.out can't be opened
echo "This will appear in shell.out"
Make sure you haven't set the noclobber option (which is useful interactively but often unusable in scripts). Use > if you want to truncate the file if it exists, and >> if you want to append instead.
If you only want to test permissions, you can run : >foo.out to create the file (or truncate it if it exists).
If you only want some commands to write to the file, open it on some other descriptor, then redirect as needed.
set -e
exec 3>foo.out
echo "This will appear on the standard output"
echo >&3 "This will appear in foo.out"
echo "This will appear both on standard output and in foo.out" | tee /dev/fd/3
(/dev/fd is not supported everywhere; it's available at least on Linux, *BSD, Solaris and Cygwin.)

Parsing result of Diff in Shell Script

I want to compare two files and see if they are the same or not in my shell script, my way is:
diff_output=`diff ${dest_file} ${source_file}`
if [ some_other_condition -o ${diff_output} -o some_other_condition2 ]
then
....
fi
Basically, if they are the same ${diff_output} should contain nothing and the above test would evaluate to true.
But when I run my script, it says
[: too many arguments
On the if [....] line.
Any ideas?
Do you care about what the actual differences are, or just whether the files are different? If it's the latter you don't need to parse the output; you can check the exit code instead.
if diff -q "$source_file" "$dest_file" > /dev/null; then
: # files are the same
else
: # files are different
fi
Or use cmp which is more efficient:
if cmp -s "$source_file" "$dest_file"; then
: # files are the same
else
: # files are different
fi
There's an option provided precisely for doing this: -q (or --quiet). It tells diff to just let the exit code indicate whether the files were identical. That way you can do this:
if diff -q "$dest_file" "$source_file"; then
# files are identical
else
# files differ
fi
or if you want to swap the two clauses:
if ! diff -q "$dest_file" "$source_file"; then
# files differ
else
# files are identical
fi
If you really wanted to do it your way (i.e. you need the output) you should do this:
if [ -n "$diff_output" -o ... ]; then
...
fi
-n means "test if the following string is non-empty. You also have to surround it with quotes, so that if it's empty, the test still has a string there - you're getting your error because your test evaluates to some_other_condition -o -o some_other_condition2, which as you can see isn't going to work.
diff is for comparing files line by line for processing the differences tool like patch . If you just want to check if they are different, you should use cmp:
cmp --quiet $source_file $dest_file || echo Different
diff $FILE $FILE2
if [ $? = 0 ]; then
echo “TWO FILES ARE SAME”
else
echo “TWO FILES ARE SOMEWHAT DIFFERENT”
fi
Check files for difference in bash
source_file=abc.txt
dest_file=xyz.txt
if [[ -z $(sdiff -s $dest_file $source_file) ]]; then
echo "Files are same"
else
echo "Files are different"
fi
Code tested on RHEL/CentOS Linux (6.X and 7.X)

Resources