Pruning old backups in several steps - linux

I am looking for a way to thin out old backups. The backups are run on a daily basis, and I want to increase the interval as the backups become older.
After a couple of days I'd like to remove the daily backups, leaving only the "Sunday" backup. After a couple of weeks, only the first backup of a month that is available should be removed.
Since I am dealing with historic backups, I cannot just change the naming scheme.
I tried to use 'find' for it, but couldn't find the right options.
Anyone got anything that might help?

I know it is historical data, but you might prefer coming up with a naming scheme to assist this problem. It might be far easier to tackle this problem in two passes: first, renaming the directories based on the date, then selecting the directories to keep in the future.
You could make a quick approximation, if all the directory dates in ls -l output look good enough:
ls -l | awk '{print "mv " $8 " " $6;}' > /tmp/runme
Look at /tmp/runme, and if it looks good, you can run it with sh /tmp/runme. You might wish to prune the entries or something like that, up to you.
If all the backups are stored in directories named, e.g:
2011-01-01/
2011-01-02/
2011-01-03/
...
2011-02-01/
2011-02-02/
...
2011-03-07/
then your problem would be reduced to computing the names to keep and delete. This problem is much easier to solve than searching through all your files and trying to select which ones to keep and delete based on when they were made. (See date "+%Y-%m-%d" output for a quick way to generate this sort of name.)
Once they are named conveniently, you can keep the first backup of every month with a script like this:
for y in `seq 2008 2010`
do for m in `seq -w 1 12`
do for d in `seq -w 2 31`
do echo "rm $y-$m-$d"
done
done
done
Save its output, inspect it :) and then run the output, similar to the rename script.
Once you've got the past backups under control, then you can generate the 2010 from date --date="Last Year" "+%Y", and other improvements so it handles "one a week" for the current month and maintains itself forever going forward.

I've developed a solution for my similar needs on top of #ajreal's starting point. My backups are named like "backup-2015-06-01T01:00:01" (using date "+%Y-%m-%dT%H:%M:%S").
Two simple steps: touch the files to keep using a shell glob pattern for first-of-each-month, and use find and xargs to delete anything more than 30 days old.
cd $BACKUPS_DIR
# touch backups from the first of each month
touch *-01T*
# delete backups more than 30 days old
echo "Deleting these backups:"
find -maxdepth 1 -mtime +30
find -maxdepth 1 -mtime +30 -print0 | xargs -0 rm -r

yup, for example
find -type f -mtime 30
details -
http://www.gnu.org/software/findutils/manual/html_mono/find.html#Age-Ranges

Related

Quickly list random set of files in directory in Linux

Question:
I am looking for a performant, concise way to list N randomly selected files in a Linux directory using only Bash. The files must be randomly selected from different subdirectories.
Why I'm asking:
In Linux, I often want to test a random selection of files in a directory for some property. The directories contain 1000's of files, so I only want to test a small number of them, but I want to take them from different subdirectories in the directory of interest.
The following returns the paths of 50 "randomly"-selected files:
find /dir/of/interest/ -type f | sort -R | head -n 50
The directory contains many files, and resides on a mounted file system with slow read times (accessed through ssh), so the command can take many minutes. I believe the issue is that the first find command finds every file (slow), and only then prints a random selection.
If you are using locate and updatedb updates regularly (daily is probably the default), you could:
$ locate /home/james/test | sort -R | head -5
/home/james/test/10kfiles/out_708.txt
/home/james/test/10kfiles/out_9637.txt
/home/james/test/compr/bar
/home/james/test/10kfiles/out_3788.txt
/home/james/test/test
How often do you need it? Do the work periodically in advance to have it quickly available when you need it.
Create a refreshList script.
#! /bin/env bash
find /dir/of/interest/ -type f | sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
Put it in your crontab.
0 7-20 * * 1-5 nice -25 ~/refresh
Then you will always have a ~/rand.list that's under an hour old.
If you don't want to use cron and aren't too picky about how old it is, just write a function that refreshes the file after you use it every time.
randFiles() {
cat ~/rand.list
{ find /dir/of/interest/ -type f |
sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
} &
}
If you can't run locate and the find command is too slow, is there any reason this has to be done in real time?
Would it be possible to use cron to dump the output of the find command into a file and then do the random pick out of there?

Bash command to archive files daily based on date added

I have a suite of scripts that involve downloading files from a remote server and then parsing them. Each night, I would like to create an archive of the files downloaded that day.
Some constraints are:
Downloading from a Windows server to an Ubuntu server.
Inability to delete files on the remote server.
Require the date added to the local directory, not the date the file was created.
I have deduplication running at the downloading stage; however, (using ncftp), the check involves comparing the remote and local directories. A strategy is to create a new folder each day, download files into it and then tar it sometime after midnight. A problem arises in that the first scheduled download on the new day will grab ALL files on the remote server because the new local folder is empty.
Because of the constraints, I considered simply archiving files based on "date added" to a central folder. This works very well using a Mac because HFS+ stores extended metadata such as date created and date added. So I can combine a tar command with something like below:
mdls -name kMDItemFSName -name kMDItemDateAdded -raw *.xml | \
xargs -0 -I {} echo {} | \
sed 'N;s/\n/ /' | \
but there doesn't seem to be an analogue under linux (at least not with EXT4 that I am aware of).
I am open to any form of solution to get around doubling up files into a subsequent day. The end result should be an archives directory full of tar.gz files looking something like:
files_$(date +"%Y-%m-%d").tar.gz
Depending on the method that is used to backup the files, the modified or changed date should reflect the time it was copied - for example if you used cp -p to back them up, the modified date would not change but the changed date would reflect the time of copy.
You can get this information using the stat command:
stat <filename>
which will return the following (along with other file related info not shown):
Access: 2016-05-28 20:35:03.153214170 -0400
Modify: 2016-05-28 20:34:59.456122913 -0400
Change: 2016-05-29 01:39:52.070336376 -0400
This output is from a file that I copied using cp -p at the time shown as 'change'.
You can get just the change time by calling stat with a specified format:
stat -c '%z' <filename>
2016-05-29 01:39:56.037433640 -0400
or with capital Z for that time in seconds since epoch. You could combine that with the date command to pull out just the date (or use grep, etc)
date -d "`stat -c '%z' <filename>" -I
2016-05-29
The command find can be used to find files by time frame, in this case using the flags -cmin 'changed minutes', -mmin 'modified minutes', or unlikely, -amin 'accessed minutes'. The sequence of commands to get the minutes since midnight is a little ugly, but it works.
We have to pass find an argument of "minutes since a file was last changed" (or modified, if that criteria works). So first you have to calculate the minutes since midnight, then run find.
min_since_mid=$(echo $(( $(date +%s) - $(date -d "(date -I) 0" +%s) )) / 60 | bc)
Unrolling that a bit:
$(date +%s) == seconds since epoch until 'now'
"(date -I) 0" == todays date in format "YYYY-MM-DD 0" with 0 indicating 0 seconds into the day
$(date -d "(date -I 0" +%s)) == seconds from epoch until today at midnight
Then we (effectively) echo ( $now - $midnight ) / 60 to bc to convert the results into minutes.
The find call is passed the minutes since midnight with a leading '-' indicating up to X minutes ago. A'+' would indicate X minutes or more ago.
find /path/to/base/folder -cmin -"$min_since_mid"
The actual answer
Finally to create a tgz archive of files in the given directory (and subdirectories) that have been changed since midnight today, use these two commands:
min_since_mid=$(echo $(( $(date +%s) - $(date -d "(date -I) 0" +%s) )) / 60 | bc)
find /path/to/base/folder -cmin -"${min_since_mid:-0}" -print0 -exec tar czvf /path/to/new/tarball.tgz {} +
The -print0 argument to find tells it to delimit the files with a null string which will prevent issues with spaces in names, among other things.
The only thing I'm not sure on is you should use the changed time (-cmin), the modified time (-mmin) or the accessed time (-amin). Take a look at your backup files and see which field accurately reflects the date/time of the backup - I would think changed time, but I'm not certain.
Update: changed -"$min_since_mid" to -"${min_since_mid:-0}" so that if min_since_mid isn't set you won't error out with invalid argument - you just won't get any results. You could also surround the find with an if statement to block the call if that variable isn't set properly.

Remove logstash Index/s using a bash script

I am looking for a way to remove old Logstash indexes using a script, my logstash indexs are named logstash-2016.02.29, logstash-2016.03.01 ... at the moment I use an extention in chrome called Sense to remove the indexes. see screen shot, or I can also use curl to remove the indexes, curl -XDELETE 'http://myIpAddress:9200/logstash-2016.02.29'
I would like to write a script that would run daily and remove logstash index older than 2 weeks from Elasticsearch. Is this possible and if so how can I do it using the date from the name of the index?
G
Just use the find command:
find . logstash* -mtime +14 -type f -delete
This searches in the current directory and below, for all files whose name starts with "logstash", that are older than 14 days, and then deletes them.
If the file times are totally unreliable, and you have to use the filenames, try something like this:
#!/bin/bash
testdate=$(date -d '14 days ago' '+%Y%m%d')
for f in ./logstash-[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]; do
dt=$(basename "${f//.}")
dt=${dt#logstash-}
[ $dt -le $testdate ] && rm -f "$f"
done

Linux: List file names, if last modified between a date interval

I have 2 variables, which contains dates like this: 2001.10.10
And i want to use ls with a filter, that only list files if last modified were between the first and second date
The best solution I can think of involves creating temporary files with the boundary timestamps, and then using find:
touch -t YYYYMMDD0000 oldest_file
touch -t YYYYMMDD0000 newest_file
find -maxdepth 1 -newer oldest_file -and -not -newer newest_file
rm oldest_file newest_file
You can use the -print0 option to find if you want to strip off the leading ./ from all the filenames.
If creating temporary files isn't an option, you might consider writing a script to calculate and print the age of a file, such as described here, and then using that as a predicate.
Sorry, it is not the simplest. I just now developed it, only for you. :-)
ls -l --full-time|awk '{s=$6;gsub(/[-\.]/,"",s);if ((s>="'"$from_variable"'") && (s<="'"$to_variable"'")) {print $0}}';
The problem is, that these simple commandline tools doesn't handle date type. So first we convert them to integers removing the separating "-" and "." characters (by you is it ".", by me a "-" so I remove both, this can you see in
gsub(/[-\.]/,"",s)
After the removal, we can already compare them with integers. In this example, we compare them with the integers $from_variable and with $to_variable. So, this will list files modified between $from_variable and $to_variable .
Both of "from_variable" and "to_variable" need to be environment variables in the form 20070707 (for 7. July, 2007).

How to find files with similar filename and how many of them there are with awk

I was tasked to delete old backup files from our Linux database (all except for the newest 3). Since we have multiple kinds of backups, I have to leave at least 3 backup files for each backup type.
My script should group all files with similar (matched) names together and delete all except for the last 3 files (I assume, that the OS will sort those files for me, so the newest backups will also be the last ones)
The files are in the format project_name.000000-000000.svndmp.bz2 where 0 can be any arbitrary digit and project_name can be any arbitrary name. The first 6 digits are part of the name, while the last 6 digits describe the backup's version.
So far, my code looks like this:
for i in *.svndmp.bz2 # only check backup files
do
nOfOccurences = # Need to find out, how many files have the same name
currentFile = 0
for f in awk -F"[.-]" '{print $1,$2}' $i # This doesn't work
do
if [nOfOccurences - $currentFile -gt 3]
then
break
else
rm $f
currentFile++
fi
done
done
I'm aware, that my script may try to remove old versions of a backup 4 times before moving on to the next backup. I'm not looking for performance or efficiency (we don't have that many backups).
My code is a result of 4 hours of searching the net and I'm running out of good Google queries (and my boss is starting to wonder why I'm still not back to my usual tasks)
Can anybody give me inputs, as to how I can solve my problems?
Find nOfOccurences
Make awk find files that fit the pattern "$1.$2-*"
Try this one, an see if it does what you want.
for project in `ls -1 | awk -F'-' '{ print $1}' | uniq`; do
files=`ls -1 ${project}* | sort`
n_occur=`echo "$files" | wc -l`
for f in $files; do
if ((n_occur < 3)); then
break
fi
echo "rm" $f;
((--n_occur))
done
done
If the output seems to be OK just replace the echo line.
Ah, and don't beat me if anything goes own. Use at your own risk only.

Resources