Bash Local vs remote directory comparison - linux

I'm trying to compare a local vs a remote directory and identify files which are either not present on the remote directory or different by checksum.
The goal is for the script to return a list of files to iterate through. So far I have the following, but it's not the best.
rsync -avnc /path/to/files remoteuser#remoteserver:/path/to/files/ | grep -v "sending incremental file list" | grep -v "bytes received" | grep -v "total size is" | grep -v "./"
I've just used piped grep -v calls to remove the bits I don't care about. Is there a better way to compare a local and remote directory using SSH? It seems like their should be. The important constraint is that I have to compare directories across two separate machines.

comm -3 <(ls -l /path/to/files | awk '{print $5"\t"$9}' | sort) <(ssh remoteuser#remoteserver ls -l /path/to/files | awk '{print $5"\t"$9}' | sort)
$5 is size
$9 is filename
then, print files which exists only in remote server

I would do so using a matching pair of find calls and a call to comm.
# comm -3 produces two-column output, skipping lines in common.
comm -3 <(find $LOCALDIR | sort) <(ssh remote#host find $REMOTEDIR | sort)
If you write your local and remote output to temporary files, you can easily print a list of missing files on either system; with a little cleverness in your find commands, you could likely compare file checksums between the two systems.
Note that this solution uses line-based text comparison and thus is not immune to bizarre filenames. You may need to investigate a more-clever solution (probably involving find ... -print0) if you need to handle filenames with newlines or other special characters.

Related

Can find push the filenames of the found files into the pipe?

I would like to do a find in some dir, and do a awk on the files in this direcory, and then replace the original files by each result.
find dir | xargs cat | awk ... | mv ... > filename
So I need the filename (of each of the files found by find) in the last command. How can I do that?
I would use a loop, like:
for filename in `find . -name "*test_file*" -print0 | xargs -0`
do
# some processing, then
echo "what you like" > "$filename"
done
EDIT: as noted in the comments, the benefits of -print0 | xargs -0 are lost because of the for loop. And filenames containing a white space are still not handled correctly.
The following while loop would not handle unusual filenames neither (good to know it, though it was not in the question), but filenames with a standard white space at least, so it works better, indeed:
find . -name "*test*file*" -print > files_list
while IFS= read -r filename
do
# some process
echo "what you like" > "$filename"
done < files_list
You could do something like this (but I wouldn't recommend it at all).
find dir -print0 |
xargs -0 -n 2 awk -v OFS='\0' '<process the input and write to temporary file>
END {print "temporaryfile", FILENAME}' |
xargs -0 -n 2 mv
This passes the files to awk directly two at a time (which avoids the problem with your original where cat will get hundreds (perhaps more) files as arguments all at once and spit all their content at awk via standard input at once and thus lose their individual contents and filenames entirely).
It then has awk write the processed output to a temporary file and then outputs the temporary filename and the original filename where xargs picks them up (again two at a time) and runs mv on the pairs of temporary file/original file names.
As I said at the beginning however this is a terrible way to do this.
If you have a new enough version of GNU awk (version 4.1.0 or newer) then you could just use the -i (in-place) argument to awk and use (I believe):
find dir | xargs awk -i '......'
Without that I would use a while loop of the form in Bash FAQ 001 to read the find output line-by-line and operate on it in the loop.

Linux: Reverse Sort files in directory and get second file

I am trying to get the second file, when file contents sorted in reverse (desc order) and copy it to my local directory using scp
Here's what I got:
scp -r uname#host:./backups/dir1/$(ls -r | head -2| tail -1) /tmp/data_sync/dir1/
I still seem to copy all the files when I run this script. What am I missing? TIA.
The $(...) is being interpreted locally. If you want the commands to run on the remote, you'll need to use ssh and have the remote side use scp to copy files to your local system.
Since parsing ls's output has a number of problems, I'll use find to accomplish the same thing as ls, telling it to use NUL between each filename rather than newline. sort sorts that list of filenames, and sed -n 2p prints the second element of the sorted list of filenames. xargs runs the scp command, inserting the filename as the first argument.
ssh uname#host "find ./backups/dir1/ -mindepth 1 -maxdepth 1 -name '[^.]*' -print0 | \
sort -r -z | sed -z -n 2p | \
xargs -0 -I {} scp {} yourlocalhost:/tmp/data_sync/dir1/"
If I got your question, your command is ok with just one specification:
you first ran scp -r which recursively scps your files which have theri content sorted in reverse order.
Try without -r:
scp uname#host:./backups/dir1/$(ls -r | head -2 | tail -1) /tmp/data_sync/dir1/
The basic syntax for scp is:
scp username#source:/location/to/file username#destination:/where/to/put
Don't forget that -rrecursively copy entire directories. More, note that scp follows symbolic links encountered in the tree traversal.

How to find files with same name part in directory using the diff command?

I have two directories with files in them. Directory A contains a list of photos with numbered endings (e.g. janet1.jpg laura2.jpg) and directory B has the same files except with different numbered endings (e.g. janet41.jpg laura33.jpg). How do I find the files that do not have a corresponding file from directory A and B while ignoring the numbered endings? For example there is a rachael3 in directory A but no rachael\d in directory B. I think there's a way to do with the diff command in bash but I do not see an obvious way to do it.
I can't see a way to use diff for this directly. It will probably be easier to use a sums tool (md5, sha1, etc.) on both directories and then sort both files based on the first (sum) column and diff/compare those output files.
Alternatively, something like findimagedupes (which isn't as simple a comparison as diff or a sums check) might be a simpler (and possibly more useful) solution.
It seems you know that your files are the same, if they exist and you are sure, there is only one of a kind per directory.
So to diff the contents of the directory according to this, you need to get only the relevant parts of the file name ("laura", "janet").
This could be done by simple grepping the appropriate parts from the output of ls like this:
ls dir1/ | egrep -o '^[a-A]+'
Then to compare, let's say dir1 and dir2, you can use:
diff <(ls dir1/ | egrep -o '^[a-A]+') <(ls dir2/ | egrep -o '^[a-A]+')
Assuming the files are simply renamed and otherwise identical, a simple solution to find the missing ones is to use md5sum (or sha or somesuch) and uniq:
#!/bin/bash
md5sum A/*.jpg B/*.jpg >index
awk '{print $1}' <index | sort >sums # delete dir/file
# list unique files (missing from one directory)
uniq -u sums | while read s; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done
This fails in the case where a folder contains several copies of the same file renamed (such that the hash matches multiple files in one folder), but that is easily fixed:
#!/bin/bash
md5sum A/*.jpg B/*.jpg > index
sed 's/\/.*//' <index | sort >sums # just delete /file
# list unique files (missing from one directory)
uniq sums | awk '{print $1}' |\
uniq -u | while read s junk; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done

ksh storing result of a command to a variable

I want to store the result of a command to a variable in my shell script. I cant seem to get it to work. I want the most recently dated file in the directory.
PRODUCT= 'ls -t /some/dir/file* | head -1 | xargs -n1 basename'
it wont work
you have two options, either $ or backsticks`.
1) x=$(ls -t /some/dir/file* | head -1 | xargs -n1 basename)
or
2) x=`ls -t /some/dir/file* | head -1 | xargs -n1 basename`
echo $x
Edit: removing unnecessary bracket for (2).
The problem that you're having is that the command needs to be surrounded by back-ticks rather than single quotes. This is known as 'Command Substitution'.
Bash allows you to use $() for command substitution, but this is not available in all shells. I don't know if it's available in KSH; if it is, it's probably not available in all versions.
If the $() syntax is available in your version of ksh, you should definitely use it; it's easier to read (back ticks are too easy to confuse with single quotes); back-ticks are also hard to nest.
This only addresses one of the problems with your command, however: ls returns directories as well as files, so if the most recent thing modified in the specified directory is a sub-directory, that is what you will see.
If you only want to see files, I suggest using some version of the following (I'm using Bash, which supports default variables, you'll probably have to play around with the syntax of $1)
lastfile ()
{
find ${1:-.} -maxdepth 1 -type f -printf "%T+ %p\n" | sort -n | tail -1 | sed 's/[^[:space:]]\+ //'
}
This runs find on the directory, and only pulls files from that directory. It formats all of the files like this:
2012-08-29+16:21:40.0000000000 ./.sqlite_history
2013-01-14+08:52:14.0000000000 ./.davmail.properties
2012-04-04+16:16:40.0000000000 ./.DS_Store
2010-04-21+15:49:00.0000000000 ./.joe_state
2008-09-05+17:15:28.0000000000 ./.hplip.conf
2012-01-31+13:12:28.0000000000 ./.oneclick
sorts the list, takes the last line, and chops off everything before the first space.
You want $() (preferred) or backticks (``) (older style), rather than single quotes:
PRODUCT=$(ls -t /some/dir/file* | head -1 | xargs -n1 basename)
or
PRODUCT=`ls -t /some/dir/file* | head -1 | xargs -n1 basename`
You need both quotes to ensure you keep the name even if it contains spaces, and also in case you later want more than 1 file, and "$(..)" to run commands in background
I believe you also need the '-1' option to ls, otherwise you could have several names per lines (you only keep 1 line, but it could be several files)
PRODUCT="$(ls -1t /some/dir/file* | head -1 | xargs -n1 basename)"
Please do not put space around the "=" variable assignments (as I saw on other solutions here) , as it's not very compatible as well.
I would do something like:
Your version corrected:
PRODUCT=$(ls -t /some/dir/file* | head -1 | xargs -n1 basename)
Or simpler:
PRODUCT=$(cd /some/dir && ls -1t file* | head -1)
change to the directory
list one filename per line and sort by time/date
grab the first line

Find lines common to several files

I'm trying to determine which header declares a specific function. I've used grep to find instances of the function's use; now, I want to find which header is included by all the files. I'm aware of the comm utility; however, it can only compare two sorted files. Is there a Unix utility that can find the common lines between an arbitrary number of unsorted files, or must I write my own?
cat *.c | sort | uniq -c | grep -e '^ *COUNT #include'
where COUNT is the number of files passed to cat. In playing around, I used this variant to see what files I #include at least 10 times:
cat *.c | sort | uniq -c | grep -e '^ *[0-9][0-9]\+ #include'

Resources