Compare and sync two (huge) directories - consider only filenames - linux

I want to do a one-way sync in Linux between two directories. One contains files and the other one contains processed files but has the same directory structure and the same filenames but some files might be missing.
Right now I am doing:
cd $SOURCE
find * -type f | while read fname; do
if [ ! -e "$TARGET$fname" ]
then
# process the file and copy it to the target. Create directories if needed.
fi
done
which works, but is painfully slow.
Is there a better way to do this?
There are rughly 50.000.000 files, spread among directories and sub-directories. Each directory does not contain more than 255 files/subdirs.
I looked at
rsync: seems like it always does size or timestamp comparison. This will result on every file being flagged as different since the processing takes some time and changes the file contents.
diff -qr: could not figure out how to make it ignore file-sizes and content
Edit
Valid assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory. This is only partially true, but deletions from the source are rare and happen in bulk, so I will do a special case for that.

Assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory
I don't see a way around comparing 2x lists of ~50 million entries but we can try to eliminate the entry-by-entry approach of a bash looping solution ...
One idea:
# obtain sorted list of all $SOURCE files
srcfiles=$(mktemp)
cd "${SOURCE}"
find * -type f | sort > "${srcfiles}"
# obtain sorted list of all $TARGET files
tgtfiles=$(mktemp)
cd "${TARGET}"
find * -type f | sort > "${tgtfiles}"
# 'comm -23' => extract list of items that only exist in the first file - ${srcfiles}
missingfiles=$(mktemp)
comm -23 "${srcfiles}" "${tgtfiles}" > "${missingfiles}"
# process list of ${SOURCE}-only files
while read -r missingfile
do
process_and_copy "${missingfile}"
done < "${missingsfiles}"
'rm' -rf "${srcfiles}" "${tgtfiles}" "${missingfiles}"
This solution is (still) serial in nature so if there are a 'lot' of missing files the overall time to process said missing files could be appreciable.
With enough system resources (cpu, memory, disk throughput) a 'faster' solution would look at methods of parallelizing the work, eg:
running parallel find/sort/comm/process threads on different $SOURCE/$TARGET subdirectories (may work well if the number of missing files is evenly distributed across the different subdirectories) or ...
stick with the serial find/sort/comm but split ${missingfiles} into chunks and then spawn separate OS processes to process_and_copy the different chunks

Related

How to find the timestamp of the latest modified file in a directory (recursively)?

I'm working on a process that needs to be restarted upon any change to any file in a specified directory, recursively.
I want to avoid using anything heavy, like inotify. I don't need to know which files were updated, but rather only whether or not files were updated at all. Moreover, I don't need to be notified of every change, but rather only to know if any changes have happened at a specific interval, dynamically determined by the process.
There has to be a way to do this with a fairly simple bash command. I don't mind having to execute the command multiple times; performance is not my primary concern for this use case. However, it would be preferable for the command to be as fast as possible.
The only output I need is the timestamp of the last change, so I can compare it to the timestamp that I have stored in memory.
I'm also open to better solutions.
I actually found a good answer from another closely related question.
I've only modified the command a little to adapt it to my needs:
find . -type f -printf '%T#\n' | sort -n | tail -1
%T# returns the modification time as a unix timestamp, which is just what I need.
sort -n sorts the timestamps numerically.
tail -1 only keeps the last/highest timestamp.
It runs fairly quickly; ~400ms on my entire home directory, and ~30ms on the intended directory (measured using time [command]).
I just thought of an even better solution than the previous one, which also allows me to know about deleted files.
The idea is to use a checksum, but not a checksum of all files; rather, we can only do a checksum of the timestamps. If anything changes at all (new files, deleted files, modified files), then the checksum will change also!
find . -type f -printf '%T#,' | cksum
'%T#,' returns the modification time of each file as a unix timestamp, all on the same line.
cksum calculates the checksum of the timestamps.
????
Profit!!!!
It's actually even faster than the previous solution (by ~20%), because we don't need to sort (which is one of the slowest operations). Even a checksum will be much faster, especially on such a small amount of data (22 bytes per timestamp), instead of doing a checksum on each file.
Instead of remembering the timestamp of the last change, you could remember the last file that changed and find newer files using
find . -type f -newer "$lastfilethatchanged"
This does not work, however, if the same file changes again. Thus, you might need to create a temporary file with touch first:
touch --date="$previoustimestamp" "$tempfile"
find . -type f -newer "$tempfile"
where "$tempfile" could be, for example, in the memory at /dev/shm/.
$ find ./ -name "*.sqlite" -ls
here you can use this command to get info for your file .Use filters to get timestamp

Optimum directory structure for large number of files to display on a page

I currently have a single directory call "files" which contains 200,000 photos from about 100,000 members. When the number of members increases to millions, I would expect the number of files in the "files" directory to get very large. The name of the files are all random because the users named them. The only way I can do is to sort them by the user name who created those files. In essence, each user will have their own sub-directory.
The server I am running is on Linux with ext3 file system. I am wondering if I shall split up the files into sub-directories inside the "files" directory? Is there any benefit to split up the files into many sub-directories? I saw some argument that it doesn't matter.
If I do need to split, I am thinking of creating directories base on the first two characters of user ID, then a third level sub-directory with the user ID like this:
files/0/0/00024userid/ (so all user ids started with 00 will go in files/0/0/...)
files/0/1/01auser/
files/0/2/0242myuserid/
.
files/0/a/0auser/
files/0/b/0bsomeuser/
files/0/c/0comeuser/
.
files/0/z/0zero/
files/1/0/10293832/
files/1/1/11029user/
.
files/9/z/9zl34/
files/a/0/a023user2/
..
files/z/z/zztopuser/
I will be showing 50 photos at a time. What is the most efficient(fast) way for the server to pick up the files for static display? All from the same directory or from 50 different sub-directories? Any comments or thoughts is appreciated. Thanks.
Depending on the file system, there might be an upper limit to how many files a directory can hold. This, and the performance impact of storing many files in one directory is also discussed at some length in another question.
Also keep in mind that your file names will likely not be truly random - quite a lot might start with "DSC", "IMG" and the like. In a similar vein, the different users (or, indeed, the same user) might try storing two images with the same name, necessitating a level of abstraction from the file name anyway.

Strange results using Linux find

I am trying to set up a backup shell script that shall run once per week on my server and keep the weekly backups for ten weeks and it all works well, except for one thing...
I have a folder that contains many rather large files, so the ten weekly backups of that folder take up quite a large amount of disk space and many of the larger files in that folder rarely change, so I thought I would split the backup of that folder in two: one for the smaller files that is included in the 'normal' weekly backup (and kept for ten weeks) and one file for the larger files that is just updated every week, without the older weekly versions being kept.
I have used the following command for the larger files:
/usr/bin/find /other/projects -size +100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_LARGE.tar
That works as expected. The tar -v option is there for debugging. However, when archiving the smaller files, I use a similar command:
/usr/bin/find /other/projects -size -100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar
Where $FILE_END is the weekly number. The line above does not work. I had the script run the other day and it took hours and produced a file that was 70 Gb, though the expected output size is about 14 Gb (there are a lot of files). It seems there is some duplication of files in the large file, I have not been able to fully check though. Yesterday I ran the command above for the smaller files from the command line and I could see that files I know to be larger than 100 Mb were included.
However, just now I ran find /other/projects -size -100M from the command line and that produced the expected list of files.
So, if anyone has any ideas what I am doing wrong I would really appreciate tips or pointers. The file names include spaces and all sorts of characters, e.g. single quote, if that has something to do with it.
The only thing I can think of is that I am not using xargs properly and admittedly I am not very familiar with that, but I still think that the problem lies in my use of find since it is find that gives the input to xargs.
First of all, I do not know if it is considered bad form or not to answer your own question, but I am doing it anyway since I realised my error and I wanted to close this and hopefully be able to help someone having the same problem as I had.
Now, once I realised what I did wrong I frankly am a bit embarrassed that I did not see it earlier, but this is it:
I did some experimental runs from the command line and after a while I realised that the output not only listed all files, but it also listed the directories themselves. Directories are of course files too and they are smaller than 100M so they have (most likely anyway) been included and when they have been included, all files in them have also been included, regardless of their sizes. This would also explain why the output file was five times larger than expected.
So, in order to overcome this I added -type f, which includes only regular files, to the find command and lo and behold, it worked!
To recap, the adjusted command I use for the smaller files is now:
/usr/bin/find /other/projects -size -100M -type f -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar

shell script for deleting files, where files have sequence numbers

I need a shell script for deleting files, where files have sequence numbers.
For example, there is a directory like /abc/def inside of which I have files like:
xyz_1000_1_pqr.arc
xyz_1001_1_pqr.arc
xyz_1002_1_pqr.arc
xyz_1003_1_pqr.arc
xyz_1004_1_pqr.arc
xyz_1005_1_pqr.arc
xyz_1006_1_pqr.arc
xyz_1007_1_pqr.arc
xyz_1008_1_pqr.arc
Here I need to delete all the .arc files that have sequence number less than(<) 1004. That is, only files:
xyz_1000_1_pqr.arc
xyz_1001_1_pqr.arc
xyz_1002_1_pqr.arc
xyz_1003_1_pqr.arc
should be deleted.
(P.S : each file is 4-5GB and critical )
EDIT:
sorry for not mentioning earlier
Requirements:deleting the files that have been backed up and preventing the file system reaching 100% ,the backup team provides the sequence number of latest file that as been backed up
it would be very convinient if i can get a shell script that takes an argument(sequence number of the last backed up file) and delete all the files having sequence number less than the sequence number provided by the back up team
there are more than 30 servers on which i have the same scenario,and the starting sequence number(sequence number of the oldest file ) will be different for each and will not be known unless logging into each and checking the directory manualy
hence a for loop with a starting sequence number till ending sequence and rm is out of the question
A generic script that can be deployed on all servers which can work only with the sequence number of the file that has been backed up recently,is what im looking for so that it can be called as a event -reaction from a tool(OEM12c oracle related that generates Filesystem alerts )
As of now I am logging int each server manually removing them using regular expressions every time a alert is triggered file system crosses 70% which is repetitive and hectic as I have other concerns of (DBA), hence a automated script would save me a lot of time .
thanks
One way
rm xyz_100{0..3}_1_pqr.arc
if you have the start and end sequence number, its just a matter of looping over and deleting them
for (( i=$start_num ; i<=$end_num; i++ ))
do
rm xyz_${i}_*arc
done
This script starts with the number that you specify on the command line and deletes all files with that number or lower until it gets down to a number where the file doesn't exist.
#!/bin/sh
num=$1
for ((i=$num; i>=0; i--))
do
name=$(printf 'xyz_%03i_1_pqr.arc' $i)
[ -f "$name" ] || break
rm "$name"
done
The "%03i" format in the printf statement assures that the number, once formatted, will be three digits or longer. (That means that a number such as 99 is padded with zeros to become 099.) For printf, "%i" would mean format an integer, "%3i" would mean format an integer and give it at least three spaces, and "%03i" means format an integer into three spaces, left-padded with zeros as needed.
In an earlier version of this answer, I had the script check all numbers down to zero looking for files to delete. In the comments, you mention that sequence numbers may be up to 7 digits. That could make the exhaustive approach overly time consuming. In this version, I have it count down until it reaches a sequence number for which the backup has already been deleted and it stops there.

How can I compare two zip format(.tar,.gz,.Z) files in Unix

I have two gz files. I want to compare those files without extracting. for example:
first file is number.txt.gz - inside that file:
1111,589,3698,
2222,598,4589,
3333,478,2695,
4444,258,3694,
second file - xxx.txt.gz:
1111,589,3698,
2222,598,4589,
I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:
1111,589,3698,
2222,598,4589,
You can't do this.
You can compare all content from archive by comparing archives but not part of data in compressed files.
You can compare selected files in archive too without unpacking because archive has metadata with CRC32 control sum and you must compare this sum to know this without unpacking.
If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.
My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.
Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.
EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.
As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.
You cannot compare the files while they remain compressed using different techniques.
You must first decompress the files, and then find the difference between the results.
Decompression can be done with gunzip, tar, and uncompress (or zcat).
Finding the difference can be done with the diff command.
I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:
comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)
or if the shell doesn't support that, perhaps:
zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0
exact answer i want is this only
nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)
. instead of awk, nawk works perfectly and it's gzip file so use gzcat

Resources