Clearing archive files with linux bash script - linux

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.

Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.

Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).

This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

Related

Finding the oldest folder in a directory in linux even when files inside are modified

I have two folders A and B, inside that there are two files each.
which are created in the below order
mkdir A
cd A
touch a_1
touch a_2
cd ..
mkdir B
cd B
touch b_1
touch b_2
cd ..
From the above i need to find which folder was created first(not modified).
ls -c <path_to_root_before_A_and_B> | tail -1
Now this outputs as "A" (no issues here).
Now i delete the file a_1 inside the Directory A.
Now i again execute the command
ls -c <path_to_root_before_A_and_B> | tail -1
This time it shows "B".
But the directory A contains the file a_2, but the ls command shows as "B". how to overcome this
How To Get File Creation Date Time In Bash-Debian
You'll want to read the link above for that, files and directories would save the same modification time types, which means directories do not save their creation date. Methods like the ls -i one mentioned earlier may work sometimes, but when I ran it just now it got really old files mixed up with really new files, so I don't think it works exactly how you think it might.
Instead try touching a file immediately after creating a directory, save it as something like .DIRBIRTH and make it hidden. Then when trying to find the order the directories were made, just grep for which .DIRBIRTH has the oldest modification date.
Assuming that all the stars align (You're using a version of GNU stat(1) that supports the file birth time formats, you're using a filesystem that records them, and a linux kernel version new enough to support the statx(2) syscall, this script should print out all immediate subdirectories of the directory passed as its argument sorted by creation time:
#!/bin/sh
rootdir=$1
find "$rootdir" -maxdepth 1 -type d -exec stat -c "%W %n" {} + | tail -n +2 \
| sort -k1,1n | cut --complement -d' ' -f1

How to move files in Linux based on its name in a folder with a corresponding name?

I would need to move a series of files in certain folders via scripts. The files are of the format xxxx.date.0000 and I have to move them to a folder whose name is the same value given.
For example:
file hello.20190131.0000
in folder 20190131
The ideal would be to be able to create folders even before moving files but it is not a priority because I can create them by hand. I managed to get the value of dates on video with
ls * .0000 | awk -F. '{Print $ 2}'
Does anyone have any suggestions on how to proceed?
The initial awk command provided much of the answer. You just need to do something with the directory name you extract:
A simple option:
ls *.0000 | awk -F. '{printf "mkdir -p '%s'; mv '%s' '%s';",$2,$0,$2}' | sh
This might be more efficient with a large number of files:
ls *.0000 | awk -F. '{print $2}' |\
sort | uniq |\
while read dir; do
mkdir -p "$dir"
mv *."$dir".0000 "$dir"
done
I would do something like this:
ls *.0000 |\
sort |\
while read f; do
foldername="`echo $f | cut -d. -f2`"
echo mkdir +p "$foldername/"
echo mv "$f" "$foldername/"
done
i.e.: For eache of your files, I build the folder name using the cut command with a dot as field separator, and getting the second field (the date in this case); then I create that folder with mkdir -p (the -p flag avoids any warning if the folder should exist already), and finally I move the file to the brand new folder.
You can do that with rename, a.k.a. Perl rename.
Try it on a COPY of your files in a temporary directory.
If you use -p parameter, it will make any necessary directories for you automatically. If you use --dry-run parameter, you can see what it would do without actually doing anything.
rename --dry-run -p 'my #X=split /\./; $_=$X[1] . "/" . $_' hello*
Sample Output
'hello.20190131.0000' would be renamed to '20190131/hello.20190131.0000'
'hello.20190137.0000' would be renamed to '20190137/hello.20190137.0000'
All you need to know is that it passes you the current name of the file in a variable called $_ and it expects you to change that to return the new filename you would like.
So, I split the current name into elements of an array X[] with the dot (period) as the separator:
my #X = split /\./
That gives me the output directory in $X[1]. Now I can set the new filename I want by putting the new directory, a slash and the old filename into $_:
$_=$X[1] . "/" . $_
You could also try this, shorter version:
rename --dry-run -p 's/.*\.(\d+)\..*/$1\/$_/' hello*
On ArchLinux, the package you would use is called perl-rename.
On debian, it is called rename
On macOS, use homebrew like this: brew install rename

List file using ls to find meet the condition

I am writing a batch program to delete all file in a directory with condition in filename.
In the directory there's a large number of text file (~ hundreds of thousand of files) with filename fixed as "abc" + date
abc_20180820.txt
abc_20180821.txt
abc_20180822.txt
abc_20180823.txt
abc_20180824.txt
The program try to grep all the file, compare the date to a fixed-date, delete it if filename's date < fixed date.
But the problem is it took so long to handle that large amount of file (~1 hour to delete 300k files).
My question: Is there a way to compare the date when running ls command? Not get all file in a list then compare to delete, but list only file already meet the condition then delete. I think that will have better performance.
My code is
TARGET_DATE = "5-12"
DEL_DATE = "20180823"
ls -t | grep "[0-9]\{8\}".txt\$ > ${LIST}
for EACH_FILE in `cat ${LIST}` ;
do
DATE=`echo ${EACH_FILE} | cut -c${TARGET_DATE }`
COMPARE=`expr "${DATE}" \< "${DEL_DATE}"`
if [ $COMPARE -eq 1 ] ;
then
rm -f ${EACH_FILE}
fi
done
Found some similar problem but I dont know how to get it done
List file using ls with a condition and process/grep files that only whitespaces
Here is a refactoring which gets rid of the pesky ls. Looping over a large directory is still going to be somewhat slow.
# Use lowercase for private variables
# to avoid clobbering a reserved system variable
# You can't have spaces around the equals sign
del_date="20180823"
# No need for ls here
# No need for a temporary file
for filename in *[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].txt
do
# Avoid external process; use the shell's parameter substitution
date=${filename%.txt}
# This could fail if the file name contains literal shell metacharacters!
date=${date#${date%?????????}}
# Avoid expr
if [ "$date" -lt "$del_date" ]; then
# Just print the file name, null-terminated for xargs
printf '%s\0' "$filename"
fi
done |
# For efficiency, do batch delete
xargs -r0 rm
The wildcard expansion will still take a fair amount of time because the shell will sort the list of filenames. A better solution is probably to refactor this into a find command which avoids the sorting.
find . -maxdepth 1 -type f \( \
-name '*1[89][0-9][0-9][0-9][0-9][0-9][0-9].txt' \
-o -name '*201[0-7][0-9][0-9][0-9][0-9].txt' \
-o -name '*20180[1-7][0-9][0-9].txt ' \
-o -name '*201808[01][0-9].txt' \
-o -name '*2018082[0-2].txt' \
\) -delete
You could do something like:
rm 201[0-7]*.txt # remove all files from 2010-2017
rm 20180[1-4]*.txt # remove all files from Jan-Apr 2018
# And so on
...
to remove a large number of files. Then your code would run faster.
Yes it takes a lot of time if you have so many files in one folder.
It is bad idea to keep so many files in one folder. Even simple ls or find will be killing storage. And if you have some scripts which iterate over your files, you are for sure killing storage.
So after you wait for one hour to clean it. Take time and make better folders structure. It is good idea to sort files according to years/month/days ... possibly hours
e.g.
somefolder/2018/08/24/...files here
Then you can easily delete, move compress ... whole month or year.
I found a solution in this thread.
https://unix.stackexchange.com/questions/199554/get-files-with-a-name-containing-a-date-value-less-than-or-equal-to-a-given-inpu
The awk command is so powerful, only take me ~1 minute to deal with hundreds of thousand of files (1/10 compare to the loop).
ls | awk -v date="$DEL_DATE" '$0 <= date' | xargs rm -vrf
I can even count, copy, move with that command with the fastest answer I've ever seen.
COUNT="$(ls | awk -v date="${DEL_DATE}" '$0 <= target' | xargs rm -vrf | wc -l)"

Rm and Egrep -v combo

I want to remove all the logs except the current log and the log before that.
These log files are created after 20 minutes.So the files names are like
abc_23_19_10_3341.log
abc_23_19_30_3342.log
abc_23_19_50_3241.log
abc_23_20_10_3421.log
where 23 is today's date(might include yesterday's date also)
19 is the hour(7 o clock),10,30,50,10 are the minutes.
In this case i want i want to keep abc_23_20_10_3421.log which is the current log(which is currently being writen) and abc_23_19_50_3241.log(the previous one)
and remove the rest.
I got it to work by creating a folder,putting the first files in that folder and removing the files and then deleting it.But that's too long...
I also tried this
files_nodelete=`ls -t | head -n 2 | tr '\n' '|'`
rm *.txt | egrep -v "$files_nodelete"
but it didnt work.But if i put ls instead of rm it works.
I am an amateur in linux.So please suggest a simple idea..or a logic..xargs rm i tried but it didnt work.
Also read about mtime,but seems abit complicated since I am new to linux
Working on a solaris system
Try the logadm tool in Solaris, it might be the simplest way to rotate logs. If you just want to get things done, it will do it.
http://docs.oracle.com/cd/E23823_01/html/816-5166/logadm-1m.html
If you want a solution similar (but working) to your try this:
ls abc*.log | sort | head -n-2 | xargs rm
ls abc*.log: list all files, matching the pattern abc*.log
sort: sorts this list lexicographical (by name) from oldes to to newest logfile
head -n-2: return all but the last two entry in the list (you can give -n a negativ count too)
xargs rm: compose the rm command with the entries from stdin
If there are two or less files in the directory, this command will return an error like
rm: missing operand
and will not delete any files.
It is usually not a good idea to use ls to point to files. Some files may cause havoc (files which have a [Newline] or a weird character in their name are the usual exemples ....).
Using shell globs : Here is an interresting way : we count the files newer than the one we are about to remove!
pattern='abc*.log'
for i in $pattern ; do
[ -f "$i" ] || break ;
#determine if this is the most recent file, in the current directory
# [I add -maxdepth 1 to limit the find to only that directory, no subdirs]
if [ $(find . -maxdepth 1 -name "$pattern" -type f -newer "$i" -print0 | tr -cd '\000' | tr '\000' '+' | wc -c) -gt 1 ];
then
#there are 2 files more recent than $i that match the pattern
#we can delete $i
echo rm "$i" # remove the echo only when you are 100% sure that you want to delete all those files !
else
echo "$i is one of the 2 most recent files matching '${pattern}', I keep it"
fi
done
I only use the globbing mechanism to feed filenames to "find", and just use the terminating "0" of the -printf0 to count the outputed filenames (thus I have no problems with any special characters in those filenames, I just need to know how many files were outputted)
tr -cd "\000" will keep only the \000, ie the terminating NUL character outputed by print0. Then I translate each \000 to a single + character, and I count them with the wc -c. If I see 0, "$i" was the most recent file. If I see 1, "$i" was the one just a bit older (so the find sees only the most recent one). And if I see more than 1, it means the 2 files (mathching the pattern) that we want to keep are newer than "$i", so we can delete "$i"
I'm sure someone will step in with a better one, but the idea could be reused, I guess...
Thanks guyz for all the answers.
I found my answer
files=`ls -t *.txt | head -n 2 | tr '\n' '|' | rev |cut -c 2- |rev`
rm `ls -t | egrep -v "$files"`
Thank you for the help

Remove all files of a certain type except for one type in linux terminal

On my computer running Ubuntu, I have a folder full of hundreds files all named "index.html.n" where n starts at one and continues upwards. Some of those files are actual html files, some are image files (png and jpg), and some of them are zip files.
My goal is to permanently remove every single file except the zip archives. I assume it's some combination of rm and file, but I'm not sure of the exact syntax.
If it fits into your argument list and no filenames contain colon a simple pipe with xargs should do:
file * | grep -vi zip | cut -d: -f1 | tr '\n' '\0' | xargs -0 rm
First find to find matching file, then file to get file types. sed eliminates other file types and also removes everything but the filenames from the output of file. lastly, rm for deleting:
find -name 'index.html.[0-9]*' | \
xargs file | \
sed -n 's/\([^:]*\): Zip archive.*/\1/p' |
xargs rm
I would run:
for f in in index.html.*
do
file "$f" | grep -qi zip
[ $? -ne 0 ] && rm -i "$f"
done
and remove -i option if you feel confident enough
Here's the approach I'd use; it's not entirely automated, but it's less error-prone than some other approaches.
file * > cleanup.sh
or
file index.html.* > cleanup.sh
This generates a list of all files (excluding dot files), or of all index.html.* files, in your current directory and writes the list to cleanup.sh.
Using your favorite text editor (mine happens to be vim), edit cleanup.sh:
Add #!/bin/sh as the first line
Delete all lines containing the string "Zip archive"
On each line, delete everything from the : to the end of the line (in vim, :%s/:.*$//)
Replace the beginning of each line with "rm" followed by a space
Exit your editor, updating the file.
chmod +x cleanup.sh
You should now have a shell script that will delete everything except zip files.
Carefully inspect the script before running it. Look out for typos, and for files whose names contain shell metacharacters. You might need to add quotation marks to the file names.
(Note that if you do this as a one-line shell command, you don't have the opportunity to inspect the list of files you're going to delete before you actually delete them.)
Once you're satisfied that your script is correct, run
./cleanup.sh
from your shell prompt.
for i in index.html.*
do
$type = file $i;
if [[ ! $file =~ "Zip" ]]
then
rm $file
fi
done
Change the rm to a ls for testing purposes.

Resources