Shell script - How would I compare the same files of different dat - linux

I am having hundreds of files in a directory, and files are name with date as given below.
How would I compare the same files of different date.
ex :
/test/
xyz-my_S1logfile.Aug.25.gz
bhd-my_S1logfile.Aug.30.gz
ddddf-my_S2logfie.Aug.25.gz
zsed-my_S2logfie.Aug.30.gz
Compare the size of xyz-my_S1logfile.Aug.25.gz and bhd-my_S1logfile.Aug.30.gz
ddddf-my_S2logfie.Aug.25.gz and zsed-my_S2logfie.Aug.30.gz
.....

Unless I misunderstand your question, you want to find files with duplicate content within a directory. The standard way to do that is to generate a strong hash for the contents of each file. E.g. for SHA256 you can use the sha256sum tool:
sha256sum /my/dir/* > sha256sums.txt
or better yet:
find /my/dir -type f -print0 | xargs -r0 sha256sum > sha256sums.txt
Considering that no collisions have been found for any variant of SHA-2 yet, you can be reasonably confident that any files with the same hash are identical. You can then use sort and uniq to find the duplicate hashes with an occurrence count for each:
cat sha256sums.txt | sort | cut -b -32 | uniq -cd | sort -nr
You can then grep your sha256sums.txt file for each duplicate hash for the corresponding list of files.
Or, if you want an automated tool, you could try FsLint, which supports finding duplicate files.

Related

Can find push the filenames of the found files into the pipe?

I would like to do a find in some dir, and do a awk on the files in this direcory, and then replace the original files by each result.
find dir | xargs cat | awk ... | mv ... > filename
So I need the filename (of each of the files found by find) in the last command. How can I do that?
I would use a loop, like:
for filename in `find . -name "*test_file*" -print0 | xargs -0`
do
# some processing, then
echo "what you like" > "$filename"
done
EDIT: as noted in the comments, the benefits of -print0 | xargs -0 are lost because of the for loop. And filenames containing a white space are still not handled correctly.
The following while loop would not handle unusual filenames neither (good to know it, though it was not in the question), but filenames with a standard white space at least, so it works better, indeed:
find . -name "*test*file*" -print > files_list
while IFS= read -r filename
do
# some process
echo "what you like" > "$filename"
done < files_list
You could do something like this (but I wouldn't recommend it at all).
find dir -print0 |
xargs -0 -n 2 awk -v OFS='\0' '<process the input and write to temporary file>
END {print "temporaryfile", FILENAME}' |
xargs -0 -n 2 mv
This passes the files to awk directly two at a time (which avoids the problem with your original where cat will get hundreds (perhaps more) files as arguments all at once and spit all their content at awk via standard input at once and thus lose their individual contents and filenames entirely).
It then has awk write the processed output to a temporary file and then outputs the temporary filename and the original filename where xargs picks them up (again two at a time) and runs mv on the pairs of temporary file/original file names.
As I said at the beginning however this is a terrible way to do this.
If you have a new enough version of GNU awk (version 4.1.0 or newer) then you could just use the -i (in-place) argument to awk and use (I believe):
find dir | xargs awk -i '......'
Without that I would use a while loop of the form in Bash FAQ 001 to read the find output line-by-line and operate on it in the loop.

How to find files with same name part in directory using the diff command?

I have two directories with files in them. Directory A contains a list of photos with numbered endings (e.g. janet1.jpg laura2.jpg) and directory B has the same files except with different numbered endings (e.g. janet41.jpg laura33.jpg). How do I find the files that do not have a corresponding file from directory A and B while ignoring the numbered endings? For example there is a rachael3 in directory A but no rachael\d in directory B. I think there's a way to do with the diff command in bash but I do not see an obvious way to do it.
I can't see a way to use diff for this directly. It will probably be easier to use a sums tool (md5, sha1, etc.) on both directories and then sort both files based on the first (sum) column and diff/compare those output files.
Alternatively, something like findimagedupes (which isn't as simple a comparison as diff or a sums check) might be a simpler (and possibly more useful) solution.
It seems you know that your files are the same, if they exist and you are sure, there is only one of a kind per directory.
So to diff the contents of the directory according to this, you need to get only the relevant parts of the file name ("laura", "janet").
This could be done by simple grepping the appropriate parts from the output of ls like this:
ls dir1/ | egrep -o '^[a-A]+'
Then to compare, let's say dir1 and dir2, you can use:
diff <(ls dir1/ | egrep -o '^[a-A]+') <(ls dir2/ | egrep -o '^[a-A]+')
Assuming the files are simply renamed and otherwise identical, a simple solution to find the missing ones is to use md5sum (or sha or somesuch) and uniq:
#!/bin/bash
md5sum A/*.jpg B/*.jpg >index
awk '{print $1}' <index | sort >sums # delete dir/file
# list unique files (missing from one directory)
uniq -u sums | while read s; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done
This fails in the case where a folder contains several copies of the same file renamed (such that the hash matches multiple files in one folder), but that is easily fixed:
#!/bin/bash
md5sum A/*.jpg B/*.jpg > index
sed 's/\/.*//' <index | sort >sums # just delete /file
# list unique files (missing from one directory)
uniq sums | awk '{print $1}' |\
uniq -u | while read s junk; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done

Bash Local vs remote directory comparison

I'm trying to compare a local vs a remote directory and identify files which are either not present on the remote directory or different by checksum.
The goal is for the script to return a list of files to iterate through. So far I have the following, but it's not the best.
rsync -avnc /path/to/files remoteuser#remoteserver:/path/to/files/ | grep -v "sending incremental file list" | grep -v "bytes received" | grep -v "total size is" | grep -v "./"
I've just used piped grep -v calls to remove the bits I don't care about. Is there a better way to compare a local and remote directory using SSH? It seems like their should be. The important constraint is that I have to compare directories across two separate machines.
comm -3 <(ls -l /path/to/files | awk '{print $5"\t"$9}' | sort) <(ssh remoteuser#remoteserver ls -l /path/to/files | awk '{print $5"\t"$9}' | sort)
$5 is size
$9 is filename
then, print files which exists only in remote server
I would do so using a matching pair of find calls and a call to comm.
# comm -3 produces two-column output, skipping lines in common.
comm -3 <(find $LOCALDIR | sort) <(ssh remote#host find $REMOTEDIR | sort)
If you write your local and remote output to temporary files, you can easily print a list of missing files on either system; with a little cleverness in your find commands, you could likely compare file checksums between the two systems.
Note that this solution uses line-based text comparison and thus is not immune to bizarre filenames. You may need to investigate a more-clever solution (probably involving find ... -print0) if you need to handle filenames with newlines or other special characters.

Find lines common to several files

I'm trying to determine which header declares a specific function. I've used grep to find instances of the function's use; now, I want to find which header is included by all the files. I'm aware of the comm utility; however, it can only compare two sorted files. Is there a Unix utility that can find the common lines between an arbitrary number of unsorted files, or must I write my own?
cat *.c | sort | uniq -c | grep -e '^ *COUNT #include'
where COUNT is the number of files passed to cat. In playing around, I used this variant to see what files I #include at least 10 times:
cat *.c | sort | uniq -c | grep -e '^ *[0-9][0-9]\+ #include'

Linux: cat matching files in date order?

I have a few files in a directory with names similar to
_system1.log
_system2.log
_system3.log
other.log
but they are not created in that order.
Is there a simple, non-hardcoded, way to cat the files starting with the underscore in date order?
Quick 'n' dirty:
cat `ls -t _system*.log`
Safer:
ls -1t _system*.log | xargs -d'\n' cat
Use ls:
ls -1t | xargs cat
ls -1 | xargs cat
You can concatenate and also store them in a single file according to their time of creation and also you can specify the files which you want to concatenate. Here, I find it very useful. The following command will concatenate the files which are arranged according to their time of creaction and have common string 'xyz' in their file name and store all of them in outputfile.
cat $(ls -t |grep xyz)>outputfile

Resources