Find and tar for each file in Linux - linux

I have a list of files with different modification times, 1_raw,2_raw,3_raw... I want to find files that are modified more than 10 days ago and zip them to release disk space. However, the command:
find . -mtime +10 |xargs tar -cvzf backup.tar.gz
will create a new file backup.tar.gz
What I want is to create a tarball for each file, so that I can easily unzip each of them when needed. After the command, my files should become: 1_raw.tar.gz, 2_raw.tar.gz, 3_raw.tar.gz...
Is there anyway to do this? Thanks!

Something like this is what you are after:
find . -mtime +10 -type f -print0 | while IFS= read -r -d '' file; do
tar -cvzf "${file}.tar.gz" "$file"
done
The -type f was added so that it doesn't also process directories, just files.
This adds a compressed archive of each file that was modified more than 10 days ago, in all subdirectories, and places the compressed archive next to its respective unarchived version (in the same folder). I assume this is what you wanted.
If you didn't need to handle whitespaces in the path, you could do with simply:
for f in $(find . -mtime +10 -type f) ; do
tar -cvzf "${f}.tar.gz" "$f"
done

Simply, try this
$ find . -mtime +10 | xargs -I {} tar czvf {}.tar.gz {}
Here, {} indicates replace-str
-I replace-str
Replace occurrences of replace-str in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items; instead the separator is the newline character. Implies -x and -L 1.
https://linux.die.net/man/1/xargs

Related

Write a script that deletes all the regular files (not the directories) with a .js extension that are present in the current directory and all its sub [duplicate]

I'm trying to work out a command which deletes sql files older than 15 days.
The find part is working but not the rm.
rm -f | find -L /usr/www2/bar/htdocs/foo/rsync/httpdocs/db_backups -type f \( -name '*.sql' \) -mtime +15
It kicks out a list of exactly the files I want deleted but is not deleting them. The paths are correct.
usage: rm [-f | -i] [-dIPRrvW] file ...
unlink file
/usr/www2/bar/htdocs/foo/rsync/httpdocs/db_backups/20120601.backup.sql
...
/usr/www2/bar/htdocs/foo/rsync/httpdocs/db_backups/20120610.backup.sql
What am I doing wrong?
You are actually piping rm's output to the input of find. What you want is to use the output of find as arguments to rm:
find -type f -name '*.sql' -mtime +15 | xargs rm
xargs is the command that "converts" its standard input into arguments of another program, or, as they more accurately put it on the man page,
build and execute command lines from standard input
Note that if file names can contain whitespace characters, you should correct for that:
find -type f -name '*.sql' -mtime +15 -print0 | xargs -0 rm
But actually, find has a shortcut for this: the -delete option:
find -type f -name '*.sql' -mtime +15 -delete
Please be aware of the following warnings in man find:
Warnings: Don't forget that the find command line is evaluated
as an expression, so putting -delete first will make find try to
delete everything below the starting points you specified. When
testing a find command line that you later intend to use with
-delete, you should explicitly specify -depth in order to avoid
later surprises. Because -delete implies -depth, you cannot
usefully use -prune and -delete together.
P.S. Note that piping directly to rm isn't an option, because rm doesn't expect filenames on standard input. What you are currently doing is piping them backwards.
find /usr/www/bar/htdocs -mtime +15 -exec rm {} \;
Will select files in /usr/www/bar/htdocs older than 15 days and remove them.
Another simpler method is to use locate command. Then, pipe the result to xargs.
For example,
locate file | xargs rm
Use xargs to pass arguments, with the option -rd '\n' to ignore spaces in names:
"${command}" | xargs -rd '\n' rm
Include --force if you want to also remove read only files.
Assuming you aren't in the directory containing the *.sql backup files:
find /usr/www2/bar/htdocs/foo/rsync/httpdocs/db_backups/*.sql -mtime +15 -exec rm -v {} \;
The -v option above is handy it will verbosely output which files are being deleted as they are removed.
I like to list the files that will be deleted first to be sure. E.g:
find /usr/www2/bar/htdocs/foo/rsync/httpdocs/db_backups/*.sql -mtime +15 -exec ls -lrth {} \;

Create ZIP of hundred thousand files based on date newer than one year on Linux

I have a /folder with over a half million files created in the last 10 years. I'm restructuring the process so that in the future there are subfolders based on the year.
For now, I need to backup all files modified within the last year. I tried
zip -r /backup.zip $(find /folder -type f -mtime -365
but get error: Argument list too long.
Is there any alternative to get the files compressed and archived?
Zip has an option to read the filelist from stdin. Below is from the zip man page
-# file lists. If a file list is specified as -# [Not on MacOS],
zip takes the list of input files from standard input instead of
from the command line. For example,
zip -# foo
will store the files listed one per line on stdin in foo.zip.
This should do what you need
find /folder -type f -mtime -365 | zip -# /backup.zip
Note that I've removed the -r option because it isn't doing anything - you are explicitly selecting standard files with the find command (-type f)
You'll have to switch from passing all the files at once to piping the files one at a time to the zip command.
find /folder -type f -mtime -365 | while read FILE;do zip -r /backup.zip $FILE;done
You can also work with the -exec parameter in find, like this:
find /folder -type f -mtime -365 -exec zip -r /backup.zip \;
(or whatever your command is). For every file, the given command is executed with the file passed as a last parameter.
Find the files and then execute the zip command on as many files as possible using + as opposed to ;
find /folder -type f -mtime -365 -exec zip -r /backup.zip '{}' +

Statement that compress files older than X and after it removes old ones

Trying to do a bash script, that will compress files older than X, and after compressing removes uncompressed version. Tried something like this, but it doesn't work.
find /home/randomcat -mtime +11 -exec gzip {}\ | -exec rm;
By default, gzip will remove the uncompressed file (since it replaces it with the compressed variant). And you don't want it to run on anything else than a plain file (not on directories or devices, not on symbolic links).
So you want at least
find /home/randomcat -mtime +11 -type f -exec gzip {} \;
You could even want find(1) to avoid files with several hard links. And you might also want it to ask you before running the command. Then you could try:
find /home/randomcat -mtime +11 -type f -links 1 -ok gzip {} \;
The find command with -exec or -ok wants a semicolon (or a + sign), and you need to escape that semicolon ; from your shell. You could use ';' instead of \; to quote it...
If you use a + the find command will group several arguments (to a single gzip process), so will run less processes (but they would last longer). So you could try
find /home/randomcat -mtime +11 -type f -links 1 -exec gzip -v {} +
You may want to read more about globbing and how a shell works.
BTW, you don't need any command pipeline (as suggested by the wrong use of | in your question).
You could even consider using GNU parallel to run things in parallel, or feed some shell (with background jobs) with e.g.
find /home/randomcat -mtime +11 -type f -links 1 \
-exec printf "gzip %s &\n" {} \; | bash -x
but in practice you won't speed up a lot your processing.
find /home/randomcat -mtime +11 -exec gzip {} +
This bash script compresses the files you find with the "find command".Instead of generating new files in gzip format, convert the files to gzip format.Let's say you have three files named older than X. And their names are a,b,c.
After running find /home/randomcat -mtime +11 -exec gzip {} + command,
you will see a.gz b.gz c.gz instead of seeing a b c in /home/randomcat directory.
find /location/location -type f -ctime +15 -exec mv {} /location/backup_location \;
This will help you find all the files and move to backup folder

cat files in subdirectories using linux commands

I have the following directories:
P922_101
P922_102
.
.
Each directory, for instance P922_101 has following subdirectories:
140311_AH8MHGADXX 140401_AH8CU4ADXX
Each subdirectory, for instance 140311_AH8MHGADXX has the following files:
1_140311_AH8MH_P922_101_1.fastq.gz 1_140311_AH8MH_P922_101_2.fastq.gz
2_140311_AH8MH_P922_101_1.fastq.gz 2_140311_AH8MH_P922_101_2.fastq.gz
And files in 140401_AH8CU4ADXX are:
1_140401_AH8CU_P922_101_1.fastq.gz 1_140401_AH8CU_P922_4001_2.fastq.gz
2_140401_AH8CU_P922_101_1.fastq.gz 2_140401_AH8CU_P922_4001_2.fastq.gz
I want to do 'cat' for the files in the subdirectories in the following way:
cat 1_140311_AH8MH_P922_101_1.fastq.gz 2_140311_AH8MH_P922_101_1.fastq.gz
1_140401_AH8CU_P922_101_1.fastq.gz 2_140401_AH8CU_P922_101_1.fastq.gz > P922_101_1.fastq.gz
which means that files ending with _1.fastq.gz should be concatenated into a single file and files ending with _2.fatsq.gz into another file.
It should be run for all files in subdirectories in all directories. Could someone give a linux solution to do this?
Since they're compressed, you should probably use gzip -dc (decompress and write to stdout) -
find /somePath -type f -name "*.fastq.gz" -exec gzip -dc {} \; | \
tee -a /someOutFolder/out.txt
You can use find for this:
find /top/path -mindepth 2 -type f -name "*_1.fastq.gz" -exec cat {} \; > one_file
find /top/path -mindepth 2 -type f -name "*_2.fastq.gz" -exec cat {} \; > another_file
This will look for all the files starting from /top/path and having a name matching the pattern _1.fastq.gz / _2.fastq.gz and cat them into the desired file. -mindepth 2 makes find look for files that are at least under the current directory; this way, files in /top/path won't be matched.
Note that you will probably need zcat instead of cat, for gz files.
As you keep adding details in comments, let's see what else we can do:
Say you have the list of directories in a file directories_list, each line containing one:
while read directory
do
find $directory -mindepth 2 -type f -name "*_1.fastq.gz" -exec cat {} \; > $directory/output
done < directories_list

Find files and tar them (with spaces)

Alright, so simple problem here. I'm working on a simple back up code. It works fine except if the files have spaces in them. This is how I'm finding files and adding them to a tar archive:
find . -type f | xargs tar -czvf backup.tar.gz
The problem is when the file has a space in the name because tar thinks that it's a folder. Basically is there a way I can add quotes around the results from find? Or a different way to fix this?
Use this:
find . -type f -print0 | tar -czvf backup.tar.gz --null -T -
It will:
deal with files with spaces, newlines, leading dashes, and other funniness
handle an unlimited number of files
won't repeatedly overwrite your backup.tar.gz like using tar -c with xargs will do when you have a large number of files
Also see:
GNU tar manual
How can I build a tar from stdin?, search for null
There could be another way to achieve what you want. Basically,
Use the find command to output path to whatever files you're looking for. Redirect stdout to a filename of your choosing.
Then tar with the -T option which allows it to take a list of file locations (the one you just created with find!)
find . -name "*.whatever" > yourListOfFiles
tar -cvf yourfile.tar -T yourListOfFiles
Try running:
find . -type f | xargs -d "\n" tar -czvf backup.tar.gz
Why not:
tar czvf backup.tar.gz *
Sure it's clever to use find and then xargs, but you're doing it the hard way.
Update: Porges has commented with a find-option that I think is a better answer than my answer, or the other one: find -print0 ... | xargs -0 ....
If you have multiple files or directories and you want to zip them into independent *.gz file you can do this. Optional -type f -atime
find -name "httpd-log*.txt" -type f -mtime +1 -exec tar -vzcf {}.gz {} \;
This will compress
httpd-log01.txt
httpd-log02.txt
to
httpd-log01.txt.gz
httpd-log02.txt.gz
Would add a comment to #Steve Kehlet post but need 50 rep (RIP).
For anyone that has found this post through numerous googling, I found a way to not only find specific files given a time range, but also NOT include the relative paths OR whitespaces that would cause tarring errors. (THANK YOU SO MUCH STEVE.)
find . -name "*.pdf" -type f -mtime 0 -printf "%f\0" | tar -czvf /dir/zip.tar.gz --null -T -
. relative directory
-name "*.pdf" look for pdfs (or any file type)
-type f type to look for is a file
-mtime 0 look for files created in last 24 hours
-printf "%f\0" Regular -print0 OR -printf "%f" did NOT work for me. From man pages:
This quoting is performed in the same way as for GNU ls. This is not the same quoting mechanism as the one used for -ls and -fls. If you are able to decide what format to use for the output of find then it is normally better to use '\0' as a terminator than to use newline, as file names can contain white space and newline characters.
-czvf create archive, filter the archive through gzip , verbosely list files processed, archive name
Edit 2019-08-14:
I would like to add, that I was also able to use essentially use the same command in my comment, just using tar itself:
tar -czvf /archiveDir/test.tar.gz --newer-mtime=0 --ignore-failed-read *.pdf
Needed --ignore-failed-read in-case there were no new PDFs for today.
Why not give something like this a try: tar cvf scala.tar `find src -name *.scala`
Another solution as seen here:
find var/log/ -iname "anaconda.*" -exec tar -cvzf file.tar.gz {} +
The best solution seem to be to create a file list and then archive files because you can use other sources and do something else with the list.
For example this allows using the list to calculate size of the files being archived:
#!/bin/sh
backupFileName="backup-big-$(date +"%Y%m%d-%H%M")"
backupRoot="/var/www"
backupOutPath=""
archivePath=$backupOutPath$backupFileName.tar.gz
listOfFilesPath=$backupOutPath$backupFileName.filelist
#
# Make a list of files/directories to archive
#
echo "" > $listOfFilesPath
echo "${backupRoot}/uploads" >> $listOfFilesPath
echo "${backupRoot}/extra/user/data" >> $listOfFilesPath
find "${backupRoot}/drupal_root/sites/" -name "files" -type d >> $listOfFilesPath
#
# Size calculation
#
sizeForProgress=`
cat $listOfFilesPath | while read nextFile;do
if [ ! -z "$nextFile" ]; then
du -sb "$nextFile"
fi
done | awk '{size+=$1} END {print size}'
`
#
# Archive with progress
#
## simple with dump of all files currently archived
#tar -czvf $archivePath -T $listOfFilesPath
## progress bar
sizeForShow=$(($sizeForProgress/1024/1024))
echo -e "\nRunning backup [source files are $sizeForShow MiB]\n"
tar -cPp -T $listOfFilesPath | pv -s $sizeForProgress | gzip > $archivePath
Big warning on several of the solutions (and your own test) :
When you do : anything | xargs something
xargs will try to fit "as many arguments as possible" after "something", but then you may end up with multiple invocations of "something".
So your attempt: find ... | xargs tar czvf file.tgz
may end up overwriting "file.tgz" at each invocation of "tar" by xargs, and you end up with only the last invocation! (the chosen solution uses a GNU -T special parameter to avoid the problem, but not everyone has that GNU tar available)
You could do instead:
find . -type f -print0 | xargs -0 tar -rvf backup.tar
gzip backup.tar
Proof of the problem on cygwin:
$ mkdir test
$ cd test
$ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs touch
# create the files
$ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs tar czvf archive.tgz
# will invoke tar several time as it can'f fit 10000 long filenames into 1
$ tar tzvf archive.tgz | wc -l
60
# in my own machine, I end up with only the 60 last filenames,
# as the last invocation of tar by xargs overwrote the previous one(s)
# proper way to invoke tar: with -r (which append to an existing tar file, whereas c would overwrite it)
# caveat: you can't have it compressed (you can't add to a compressed archive)
$ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs tar rvf archive.tar #-r, and without z
$ gzip archive.tar
$ tar tzvf archive.tar.gz | wc -l
10000
# we have all our files, despite xargs making several invocations of the tar command
Note: that behavior of xargs is a well know diccifulty, and it is also why, when someone wants to do :
find .... | xargs grep "regex"
they intead have to write it:
find ..... | xargs grep "regex" /dev/null
That way, even if the last invocation of grep by xargs appends only 1 filename, grep sees at least 2 filenames (as each time it has: /dev/null, where it won't find anything, and the filename(s) appended by xargs after it) and thus will always display the file names when something maches "regex". Otherwise you may end up with the last results showing matches without a filename in front.

Resources