Tar backup strange behaviour in bash script - linux

I have bash script used to take backup of / directories and i am using pipe viewer(pv) to see the progress bar .When i run the command manually it works fine but when I run same command in bash script progress bar shows it execeeds 100% and ends up at 108% although ETA shows 00:00 when it reaches 100% but it still goes on?? Here is my script
DIR_1="/var /root /sbin /bin /etc /lib /www /usr /mnt";
CAL_1=$(du -skc $DIR_1 | awk '{print $1}' | tail -n1);
if [ "$1" = "fullbackup" ]
then
echo "`date +%F\ %H:%M` backup started..">>$FLOG;
/bin/nice -n 19 tar cf - `echo $DIR_1` | pv -s ${CAL_1}k >$DESTINATION/$FILENAME -f 2>$TARLOG;
if [ $? -eq 0 ]
then
echo "`date +%F\ %H:%M` tar archive successfull" >>$FLOG ;
NEW=$(cd $DESTINATION && ls -t full* | head | sed '1!d');
elif [ $? -ne 0 ]
then
echo "`date +%F\ %H:%M` tar archive failed">>$FLOG;
rm $DESTINATION/$FILENAME;
exit;
fi

Here's what's going on. You are telling pv that the size of the tar file is of sum of the files under $DIR_1 and sometimes it is not.
In particular, when you run the program and you get 108% rather than 100%, that the 8% difference is the the overhead of the tar file. You can easily verify this by comparing the value you get back with $CAL_1 with what you get back with a ls -l $DESTINATION/$FILENAME.
There can also be some slight discrepancies in the way du -k decides what K means versus how pv measures it. To eliminate that possibility remove the 'k' from both pv and du.
I think running this interactively is a red herring. I'm guessing when you ran interactively you ran on a small filesystem where the tar overhead was close to 0, while when you run in a script it was done on a large filesystem. (Small filesystems is what one would do interactively because who wants to wait hours to get results?).
To verify to eliminate the interactive versus script situation, make sure you run the two on the same filesystem.
If this doesn't do it, I'd suggest adding to your question specific output mentioned above such as what the value of $CAL_1 is and what an ls of the final tar file is.

Related

Remote to Local rolling backup script

I'm trying to create a bash script that runs through crontab to execute a backup remote to local. Everything works but my rolling backup part, where it only keeps 4 backups.
#!/bin/bash
dateForm=`date +%m-%d-%Y`
fileName=[redacted]-"$dateForm"
echo backup started for [redacted] on: $dateForm >> /home/backups/backLog.log
ls -tQ /home/backups/[redacted] | tail -n+5 | xargs -r rm
ssh root#[redacted] "tar jcf - -C /home/[redacted]/[redacted] ." > "/home/backups/[redacted]/$fileName".tar.bz2
if [ ! -f "/home/backups/[redacted]/$fileName.tar.bz2" ]
then
echo "something went wrong with the backup for $fileName!" >> /home/backups/backLog.log
else
echo "Backup completed for $fileName" >> /home/backups/backLog.log
fi
the ls line will work if executed in the directory just fine, but because crontab is executing it and I need the script to be outside of the folder it's targeting. I can't get it to target the rm to the correct directory utilizing the piped ls
I was able to come up with an interesting solution after studying the man page for ls a little more and utilizing find to grab the full paths.
ls -tQ $(find /home/backups/[redacted] -type f -name "*") | tail -n+5 | xargs -r rm
just posting an answer for someone that didn't want to create a rolling backup script that completely depended on date formatting, as there would ALWAYS be at least 4 backups in the folder targeted.

scp: how to find out that copying was finished

I'm using scp command to copy file from one Linux host to another.
I run scp commend on host1 and copy file from host1 to host2. File is quite big and it takes for some time to copy it.
On host2 file appears immediately as soon as copying was started. I can do everything with this file even if copying is still in progress.
Is there any reliable way to find out if copying was finished or not on host2?
Off the top of my head, you could do something like:
touch tinyfile
scp bigfile tinyfile user#host:
Then when tinyfile appears you know that the transfer of bigfile is complete.
As pointed out in the comments, this assumes that scp will copy the files one by one, in the order specified. If you don't trust it, you could do them one by one explicitly:
scp bigfile user#host:
scp tinyfile user#host:
The disadvantage of this approach is that you would potentially have to authenticate twice. If this were an issue you could use something like ssh-agent.
On sending side (host1) use script like this:
#!/bin/bash
echo 'starting transfer'
scp FILE USER#DST_SERVER:DST_PATH
OUT=$?
if [ $OUT = 0 ]; then
echo 'transfer successful'
touch successful
scp successful USER#DST_SERVER:DST_PATH
else
echo 'transfer faild'
fi
On receiving side (host2) make script like this:
#!/bin/bash
SLEEP_TIME=30
MAX_CNT=10
CNT=0
while [[ ! -e successful && $CNT < $MAX_CNT ]]; do
((CNT++))
sleep($SLEEP_TIME);
done;
if [[ -e successful ]]; then
echo 'successful'
rm successful
# do somethning with FILE
fi
With CNT and MAX_CNT you disable endless loop (in case file successful isn't transferred).
Product MAX_CNT and SLEEP_TIME should be equal or greater expected transfer time. In my example expected transfer time is less than 300 seconds.
A checksum (md5sum, sha256sum ,sha512sum) of the local and remote files would tell you if they're identical.
For the situation where you don't have SSH access to the remote system - like an FTP server - you can download the file after it's uploaded and compare the checksums. I do this for files I send from production scripts at work. Below is a snippet from the script in which I do this.
MD5SRC=$(md5sum $LOCALFILE | cut -c 1-32)
MD5TESTFILE=$(mktemp -p /ramdisk)
curl \
-o $MD5TESTFILE \
-sS \
-u $FTPUSER:$FTPPASS \
ftp://$FTPHOST/$REMOTEFILE
MD5DST=$(md5sum $MD5TESTFILE | cut -c 1-32)
if [ "$MD5SRC" == "$MD5DST" ]
then
echo "+Local and Remote files match!"
else
echo "-Local and Remote files don't match"
fi
if you use inotify-tools,
then the solution will looks like this:
while ! inotifywait -e close $(dirname ${bigfile_fullname}) 2>/dev/null | \
grep -Eo "CLOSE $(basename ${bigfile_fullname})$">/dev/null
do true
done
echo "File ${bigfile_fullname} closed"
After some investigation, and discussion of the problem on other forums I have found one more solution. Maybe it can help somebody.
There is a command "lsof". It lists open files. During copying the file will be opened, so the command
lsof | grep filename
will return non empty result.
So you might want to make a while loop to wait until lsof returns nothing and proceed with your task.
Example:
# provide your file name here
f=<nameOfYourFile>
lsofresult=`lsof | grep $f | wc -l`
while [ $lsofresult != 0 ]; do
echo still copying file $f...
sleep 5
lsofresult=`lsof | grep $f | wc -l`
done; echo copying file $f is finished: `ls $f`
For the duplicate question, How to check if file has been scp 100% to the remote location , which was for an expect script, to know if a file is transferred completely, we can add expect 100% .. .. i.e something like this ...
expect -c "
set timeout 1
spawn scp user#$REMOTE_IP:/tmp/my.file user#$HOST_IP:/home/.
expect yes/no { send yes\r ; exp_continue }
expect password: { send $SCP_PASSWORD\r }
expect 100%
sleep 1
exit
"
if [ -f "/home/my.file" ]; then
echo "Success"
fi
If avoiding a second SSH handshake is important, you can use something like the following:
ssh host cat \> bigfile \&\& touch complete < bigfile
Then wait for the "complete" file to get created on the remote end.

Shell script: Count files, delete 'X' oldest file

I am new to scripting. Currently I have a script that backs up a directory every day to a file server. It deletes the oldest file outside of 14 days. My issue is I need it to count the actual files and delete the 14th oldest one. When going by days, if the file server or host is down for a few days or longer, when back up it will delete a couple days worth of backups or even all of them. Pending down time. I want it to always have 14 days worth of backups.
I tried searching around and could only find solutions related to deleting by dates. Like what I have now.
Thank you for the help/advice!
My code I have, sorry its my first attempt at scripting:
#! /bin/sh
#Check for file. If not found, the connection to the file server is down!
if
[ -f /backup/connection ];
then
echo "File Server is connected!"
#Directory to be backed up.
backup_source="/var/www/html/moin-1.9.7"
#Backup directory.
backup_destination="/backup"
#Current date to name files.
date=`date '+%m%d%y'`
#naming the file.
filename="$date.tgz"
echo "Backing up directory"
#Creating the back up of the backup_source directory and placing it into the backup_destination directory.
tar -cvpzf $backup_destination/$filename $backup_source
echo "Backup Finished!"
#Search for folders older than '+X' days and delete them.
find /backup -type f -ctime +13 -exec rm -rf {} \;
else
echo "File Server is NOT connected! Date:`date '+%m-%d-%y'` Time:`date '+%H:%M:%S'`" > /user/Desktop/error/`date '+%m-%d-%y'`
fi
Something along the lines like this might work:
ls -1t /path/to/directory/ | head -n 14 | tail -n 1
in the ls command, -1 is to list just the filenames (nothing else), -t is to list them in chronological order (newest first). Piping through the head command takes just the first 14 from the output of the ls command, then tail -n 1 takes just the last from that list. This should give the the file that is 14th newest.
Here is another suggestion. The following script simply enumerates the backups. This eases the task of keeping track of the last n backups. If you need to know the actual creation date you can simply check the file metadata, e.g. using stat.
#!/bin/sh
set -e
backup_source='somedir'
backup_destination='backup'
retain=14
filename="backup-$retain.tgz"
check_fileserver() {
nc -z -w 5 file.server.net 80 2>/dev/null || exit 1
}
backup_advance() {
if [ -f "$backup_destination/$filename" ]; then
echo "removing $filename"
rm "$backup_destination/$filename"
fi
for i in $(seq $(($retain)) -1 2); do
file_to="backup-$i.tgz"
file_from="backup-$(($i - 1)).tgz"
if [ -f "$backup_destination/$file_from" ]; then
echo "moving $backup_destination/$file_from to $backup_destination/$file_to"
mv "$backup_destination/$file_from" "$backup_destination/$file_to"
fi
done
}
do_backup() {
tar czf "$backup_destination/backup-1.tgz" "$backup_source"
}
check_fileserver
backup_advance
do_backup
exit 0

grep from tar.gz without extracting [faster one]

Am trying to grep pattern from dozen files .tar.gz but its very slow
am using
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
If you have zgrep you can use
zgrep -a string file.tar.gz
You can use the --to-command option to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual.
Armed with the above information, you could try something like:
$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc
I know this question is 4 years old, but I have a couple different options:
Option 1: Using tar --to-command grep
The following line will look in example.tgz for PATTERN. This is similar to #Jester's example, but I couldn't get his pattern matching to work.
tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'
Option 2: Using tar -tzf
The second option is using tar -tzf to list the files, then go through them with grep. You can create a function to use it over and over:
targrep () {
for i in $(tar -tzf "$1"); do
results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2")
echo "$results"
done
}
Usage:
targrep example.tar.gz "pattern"
Both the below options work well.
$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.
The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz) and then work on the .tar file. That might help enough by itself. It's still going to loop through the entire archive N times, though.
If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:
tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
grep -l "string" $f
done
Note that grep -l prints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -l will help. If the files are huge, it will help a lot.
For starters, you could start more than one process:
tar -ztf file.tar.gz | while read FILENAME
do
(if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
then
echo "$FILENAME contains string"
fi) &
done
The ( ... ) & creates a new detached (read: the parent shell does not wait for the child)
process.
After that, you should optimize the extracting of your archive. The read is no problem,
as the OS should have cached the file access already. However, tar needs to unpack
the archive every time the loop runs, which can be slow. Unpacking the archive once
and iterating over the result may help here:
local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
(if grep -l "string" "$FILENAME"
then
echo "$FILENAME contains string"
fi) &
done && rm -r $tempPath
find is used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.
Edit: Use grep -l to speed up things, as Jim pointed out. From man grep:
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would
normally have been printed. The scanning will stop on the first match. (-l is specified
by POSIX.)
Am trying to grep pattern from dozen files .tar.gz but its very slow
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
That's actually very easy with ugrep option -z:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
Which requires just one command to search file.tar.gz as follows:
ugrep -z "string" file.tar.gz
This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt
All of the code above was really helpful, but none of it quite answered my own need: grep all *.tar.gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output:
The name of both the archive file and the extracted file
The line number where the pattern was found
The contents of the matching line
It's what I was really hoping that zgrep could do for me and it just can't.
Here's my solution:
pattern=$1
for f in *.tar.gz; do
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done
You can also replace the tar line with the following if you'd like to test that all variables are expanding properly with a basic echo statement:
tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""
Let me explain what's going on. Hopefully, the for loop and the echo of the archive filename in question is obvious.
tar -xzf: x extract, z filter through gzip, f based on the following archive file...
"$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.
--to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.
Let's break that part down by itself, since it's the "secret sauce" here.
'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is not immediately expanded/resolved. More on that in a moment.
grep: The command to be run on the (not actually) extracted files
--label=: The label to prepend the results, the value of which is enclosed in double-quotes because we do want to have the grep command resolve the $TAR_FILENAME environment variable passed in by the tar command.
basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file
-Hin: H Display filename (provided by the label), i Case insensitive search, n Display line number of match
Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.
Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)
And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.
if [ -z "$1" ]; then
echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
echo "Usage: targrep <string to search for> [start date]"
fi
pattern=$1
startdatein=$2
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
done
And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.
Usage:
targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>
Example:
targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford
while getopts "d:f:" opt; do
case $opt in
d) startdatein=$OPTARG;;
f) targetfile=$OPTARG;;
esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=$1
echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
echo "in filenames: $targetfile"
fi
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
if [[ -z "$targetfile" ]]; then
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
else
tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
fi
done
zgrep works fine for me, only if all files inside is plain text.
it looks nothing works if the tgz file contains gzip files.
You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:
pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/
This is much faster than iterating over each file and piping it to grep separately, especially for compressed TARs. Here are benchmark results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:
Compression
Ratarmount
Bash Loop over tar -O
none
0.31 +- 0.01
0.55 +- 0.02
gzip
1.1 +- 0.1
13.5 +- 0.1
bzip2
1.2 +- 0.1
97.8 +- 0.2
Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long. But, they already exemplify the problem well enough. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.
This is the code for benchmarking:
function checkFilesWithRatarmount()
{
local pattern=$1
local archive=$2
ratarmount "$archive" "$archive.mountpoint"
'grep' -r -l "$pattern" "$archive.mountpoint/"
}
function checkEachFileViaStdOut()
{
local pattern=$1
local archive=$2
tar --list --file "$archive" | while read -r file; do
if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
echo "Found pattern in: $file"
fi
done
}
function createSampleTar()
{
for i in $( seq 40 ); do
head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
done
tar -czf "$1" [0-9]*.dat
}
createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint
In my case the tarballs have a lot of tiny files and I want to know what archived file inside the tarball matches. zgrep is fast (less than one second) but doesn't provide the info I want, and tar --to-command grep is much, much slower (many minutes)1.
So I went the other direction and had zgrep tell me the byte offsets of the matches in the tarball and put that together with the list of offsets in the tarball of all archived files to find the matching archived files.
#!/bin/bash
set -e
set -o pipefail
function tar_offsets() {
# Get the byte offsets of all the files in a given tarball
# based on https://stackoverflow.com/a/49865044/60422
[ $# -eq 1 ]
tar -tvf "$1" -R | awk '
BEGIN{
getline;
f=$8;
s=$5;
}
{
offset = int($2) * 512 - and((s+511), compl(512)+1)
print offset,s,f;
f=$8;
s=$5;
}'
}
function tar_byte_offsets_to_files() {
[ $# -eq 1 ]
# Convert the search results of a tarball with byte offsets
# to search results with archived file name and offset, using
# the provided tar_offsets output (single pass, suitable for
# process substitution)
offsets_file="$1"
prev_offset=0
prev_offset_filename=""
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
while IFS=':' read -r search_result_offset match_text
do
while [ $last_offset -lt $search_result_offset ]; do
prev_offset=$last_offset
prev_offset_filename="$last_offset_filename"
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
# offsets increasing safeguard
[ $prev_offset -le $last_offset ]
done
# now last offset is the first file strictly after search result offset so prev offset is
# the one at or before it, and must be the one it is in
result_file_offset=$(( $search_result_offset - $prev_offset ))
echo "$prev_offset_filename:$result_file_offset:$match_text"
done
}
# Putting it together e.g.
zgrep -a --byte-offset "your search here" some.tgz | tar_byte_offsets_to_files <(tar_offsets some.tgz)
1 I'm running this in Git for Windows' minimal MSYS2 fork unixy environment, so it's possible that the launch overhead of grep is much much higher than on any kind of real Unix machine and would make `tar --to-command grep` good enough there; benchmark solutions for your own needs and platform situation before selecting.

Shell script to delete files when disk is full

I am writing a small little script to clear space on my linux everyday via CRON if the cache directory grows too large.
Since I am really green at bash scripting, I will need a little bit of help from you linux gurus out there.
Here is basically the logic (pseudo-code)
if ( Drive Space Left < 5GB )
{
change directory to '/home/user/lotsa_cache_files/'
if ( current working directory = '/home/user/lotsa_cache_files/')
{
delete files in /home/user/lotsa_cache_files/
}
}
Getting drive space left
I plan to get the drive space left from the '/dev/sda5' command.
If returns the following value to me for your info :
Filesystem 1K-blocks Used Available Use% Mounted on<br>
/dev/sda5 225981844 202987200 11330252 95% /
So a little regex might be necessary to get the '11330252' out of the returned value
A little paranoia
The 'if ( current working directory = /home/user/lotsa_cache_files/)' part is just a defensive mechanism for the paranoia within me. I wanna make sure that I am indeed in '/home/user/lotsa_cache_files/' before I proceed with the delete command which is potentially destructive if the current working directory is not present for some reason.
Deleting files
The deletion of files will be done with the command below instead of the usual rm -f:
find . -name "*" -print | xargs rm
This is due to the inherent inability of linux systems to 'rm' a directory if it contains too many files, as I have learnt in the past.
Just another proposal (comments within code):
FILESYSTEM=/dev/sda1 # or whatever filesystem to monitor
CAPACITY=95 # delete if FS is over 95% of usage
CACHEDIR=/home/user/lotsa_cache_files/
# Proceed if filesystem capacity is over than the value of CAPACITY (using df POSIX syntax)
# using [ instead of [[ for better error handling.
if [ $(df -P $FILESYSTEM | awk '{ gsub("%",""); capacity = $5 }; END { print capacity }') -gt $CAPACITY ]
then
# lets do some secure removal (if $CACHEDIR is empty or is not a directory find will exit
# with error which is quite safe for missruns.):
find "$CACHEDIR" --maxdepth 1 --type f -exec rm -f {} \;
# remove "maxdepth and type" if you want to do a recursive removal of files and dirs
find "$CACHEDIR" -exec rm -f {} \;
fi
Call the script from crontab to do scheduled cleanings
I would do it this way:
# get the available space left on the device
size=$(df -k /dev/sda5 | tail -1 | awk '{print $4}')
# check if the available space is smaller than 5GB (5000000kB)
if (($size<5000000)); then
# find all files under /home/user/lotsa_cache_files and delete them
find /home/user/lotsa_cache_files -name "*" -delete
fi
Here's the script I use to delete old files in a directory to free up space...
#!/bin/bash
#
# prune_dir - prune directory by deleting files if we are low on space
#
DIR=$1
CAPACITY_LIMIT=$2
if [ "$DIR" == "" ]
then
echo "ERROR: directory not specified"
exit 1
fi
if ! cd $DIR
then
echo "ERROR: unable to chdir to directory '$DIR'"
exit 2
fi
if [ "$CAPACITY_LIMIT" == "" ]
then
CAPACITY_LIMIT=95 # default limit
fi
CAPACITY=$(df -k . | awk '{gsub("%",""); capacity=$5}; END {print capacity}')
if [ $CAPACITY -gt $CAPACITY_LIMIT ]
then
#
# Get list of files, oldest first.
# Delete the oldest files until
# we are below the limit. Just
# delete regular files, ignore directories.
#
ls -rt | while read FILE
do
if [ -f $FILE ]
then
if rm -f $FILE
then
echo "Deleted $FILE"
CAPACITY=$(df -k . | awk '{gsub("%",""); capacity=$5}; END {print capacity}')
if [ $CAPACITY -le $CAPACITY_LIMIT ]
then
# we're below the limit, so stop deleting
exit
fi
fi
fi
done
fi
To detect the occupation of a filesystem, I use this :
df -k $FILESYSTEM | tail -1 | awk '{print $5}'
that gives me the occupation percentage of the filesystem, this way, I don't need to compute it :)
If you use bash, you can use the pushd/popd operation to change directory and be sure to be in.
pushd '/home/user/lotsa_cache_files/'
do the stuff
popd
Here's what I do:
while read f; do rm -rf ${f}; done < movies-to-delete.txt

Resources