How to find files with same name part in directory using the diff command? - linux

I have two directories with files in them. Directory A contains a list of photos with numbered endings (e.g. janet1.jpg laura2.jpg) and directory B has the same files except with different numbered endings (e.g. janet41.jpg laura33.jpg). How do I find the files that do not have a corresponding file from directory A and B while ignoring the numbered endings? For example there is a rachael3 in directory A but no rachael\d in directory B. I think there's a way to do with the diff command in bash but I do not see an obvious way to do it.

I can't see a way to use diff for this directly. It will probably be easier to use a sums tool (md5, sha1, etc.) on both directories and then sort both files based on the first (sum) column and diff/compare those output files.
Alternatively, something like findimagedupes (which isn't as simple a comparison as diff or a sums check) might be a simpler (and possibly more useful) solution.

It seems you know that your files are the same, if they exist and you are sure, there is only one of a kind per directory.
So to diff the contents of the directory according to this, you need to get only the relevant parts of the file name ("laura", "janet").
This could be done by simple grepping the appropriate parts from the output of ls like this:
ls dir1/ | egrep -o '^[a-A]+'
Then to compare, let's say dir1 and dir2, you can use:
diff <(ls dir1/ | egrep -o '^[a-A]+') <(ls dir2/ | egrep -o '^[a-A]+')

Assuming the files are simply renamed and otherwise identical, a simple solution to find the missing ones is to use md5sum (or sha or somesuch) and uniq:
#!/bin/bash
md5sum A/*.jpg B/*.jpg >index
awk '{print $1}' <index | sort >sums # delete dir/file
# list unique files (missing from one directory)
uniq -u sums | while read s; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done
This fails in the case where a folder contains several copies of the same file renamed (such that the hash matches multiple files in one folder), but that is easily fixed:
#!/bin/bash
md5sum A/*.jpg B/*.jpg > index
sed 's/\/.*//' <index | sort >sums # just delete /file
# list unique files (missing from one directory)
uniq sums | awk '{print $1}' |\
uniq -u | while read s junk; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done

Related

Bash Local vs remote directory comparison

I'm trying to compare a local vs a remote directory and identify files which are either not present on the remote directory or different by checksum.
The goal is for the script to return a list of files to iterate through. So far I have the following, but it's not the best.
rsync -avnc /path/to/files remoteuser#remoteserver:/path/to/files/ | grep -v "sending incremental file list" | grep -v "bytes received" | grep -v "total size is" | grep -v "./"
I've just used piped grep -v calls to remove the bits I don't care about. Is there a better way to compare a local and remote directory using SSH? It seems like their should be. The important constraint is that I have to compare directories across two separate machines.
comm -3 <(ls -l /path/to/files | awk '{print $5"\t"$9}' | sort) <(ssh remoteuser#remoteserver ls -l /path/to/files | awk '{print $5"\t"$9}' | sort)
$5 is size
$9 is filename
then, print files which exists only in remote server
I would do so using a matching pair of find calls and a call to comm.
# comm -3 produces two-column output, skipping lines in common.
comm -3 <(find $LOCALDIR | sort) <(ssh remote#host find $REMOTEDIR | sort)
If you write your local and remote output to temporary files, you can easily print a list of missing files on either system; with a little cleverness in your find commands, you could likely compare file checksums between the two systems.
Note that this solution uses line-based text comparison and thus is not immune to bizarre filenames. You may need to investigate a more-clever solution (probably involving find ... -print0) if you need to handle filenames with newlines or other special characters.

Find specific string in subdirectories and order top directories by modification date

I have a directory structure containing some files. I'm trying to find the names of top directories that do contain a file with specific string in it.
I've got this:
grep -r abcdefg . | grep commit_id | sed -r 's/\.\/(.+)\/.*/\1/';
Which returns something like:
topDir1
topDir2
topDir3
I would like to be able to take this output and somehow feed it into this command:
ls -t | grep -e topDir1 -e topDir2 -e topDir3
which would returned the output filtered by the first command and ordered by modification date.
I'm hoping for a one liner. Or maybe there is a better way of doing it?
This should work as long as none of the directory names contain whitespace or wildcard characters:
ls -td $(grep -r abcdefg . | grep commit_id | dirname)

how to compare output of two ls in linux

So here is the task which I can't solve. I have a directory with .h files and a directory with .i files, which have the same names as the .h files. I want just by typing a command to have all .h files which are not found as .i files. It's not a hard problem, I can do it in some programming language, but I'm just curious how it will look like in cmd :). To be more specific here is the algo:
get file names without extensions from ls *.h
get file names without extensions from ls *.i
compare them
print all names from 1 that are not met in 2
Good luck!
diff \
<(ls dir.with.h | sed 's/\.h$//') \
<(ls dir.with.i | sed 's/\.i$//') \
| grep '$<' \
| cut -c3-
diff <(ls dir.with.h | sed 's/\.h$//') <(ls dir.with.i | sed 's/\.i$//') executes ls on the two directories, cuts off the extensions, and compares the two lists. Then grep '$<' finds the files that are only in the first listing, and cut -c3- cuts off the "< " characters that diff inserted.
ls ./dir_h/*.h | sed -r -n 's:.*dir_h/([^.]*).h$:dir_i/\1.i:p' | xargs ls 2>&1 | \
grep "No such file or directory" | awk '{print $4}' | sed -n -r 's:dir_i/([^:]*).*:dir_h/\1:p'
ls -1 dir1/*.hh dir2/*.ii | awk -F"/" '{print $NF}' |awk -F"." '{a[$1]++;b[$0]}END{for(i in a)if(a[i]==1 && b[i".hh"]) print i}'
explanation:
ls -1 dir1/*.hh dir2/*.ii
above will list all the files *.hh and *.ii files in both the directories.
awk -F"/" '{print $NF}'
above will just print the file name excluding the complete path of the file.
awk -F"." '{a[$1]++;b[$0]}END{for(i in a)if(a[i]==1 && b[i".hh"]) print i}'
above will create two associative arrays one with file name and one with excluding the extension.
if both hh and ii files exist the value in the assosciative array will 2 if there is only one file then the value will be 1.so we need array item whose value is 1 and it should be a header file (.hh).
this can be checked using the asso..array b which is done in the END block.
Assuming bash is your shell:
for file in $( ls dir_with_h/*.h ); do
name=${file%\.h}; # trim trailing ".h" file extension
name=${name#dir_with_h/}; # trim leading folder name
if [ ! -e dir_with_i/${name}.i ]; then
echo ${name};
fi
done
Undoubtedly this can be ported to virtually all other shells. I find this less cryptic than some other approaches (although this is surely my problem) but it is a little wordy. As such. a shell script might help recall it.

Shell script - How would I compare the same files of different dat

I am having hundreds of files in a directory, and files are name with date as given below.
How would I compare the same files of different date.
ex :
/test/
xyz-my_S1logfile.Aug.25.gz
bhd-my_S1logfile.Aug.30.gz
ddddf-my_S2logfie.Aug.25.gz
zsed-my_S2logfie.Aug.30.gz
Compare the size of xyz-my_S1logfile.Aug.25.gz and bhd-my_S1logfile.Aug.30.gz
ddddf-my_S2logfie.Aug.25.gz and zsed-my_S2logfie.Aug.30.gz
.....
Unless I misunderstand your question, you want to find files with duplicate content within a directory. The standard way to do that is to generate a strong hash for the contents of each file. E.g. for SHA256 you can use the sha256sum tool:
sha256sum /my/dir/* > sha256sums.txt
or better yet:
find /my/dir -type f -print0 | xargs -r0 sha256sum > sha256sums.txt
Considering that no collisions have been found for any variant of SHA-2 yet, you can be reasonably confident that any files with the same hash are identical. You can then use sort and uniq to find the duplicate hashes with an occurrence count for each:
cat sha256sums.txt | sort | cut -b -32 | uniq -cd | sort -nr
You can then grep your sha256sums.txt file for each duplicate hash for the corresponding list of files.
Or, if you want an automated tool, you could try FsLint, which supports finding duplicate files.

Linux: cat matching files in date order?

I have a few files in a directory with names similar to
_system1.log
_system2.log
_system3.log
other.log
but they are not created in that order.
Is there a simple, non-hardcoded, way to cat the files starting with the underscore in date order?
Quick 'n' dirty:
cat `ls -t _system*.log`
Safer:
ls -1t _system*.log | xargs -d'\n' cat
Use ls:
ls -1t | xargs cat
ls -1 | xargs cat
You can concatenate and also store them in a single file according to their time of creation and also you can specify the files which you want to concatenate. Here, I find it very useful. The following command will concatenate the files which are arranged according to their time of creaction and have common string 'xyz' in their file name and store all of them in outputfile.
cat $(ls -t |grep xyz)>outputfile

Resources