How to write a Linux shell script that removes files older than X days, but leaves the first file of the day by modification time? - linux

As the title says, how could that shell script be implemented. I know its easy to find files and delete files older than v.gr. 29 days using:
find /some_folder/ -name "file_prefix*" -mtime +30 -exec rm {} \;
But how to add exception, that first file of each day by modification time is not removed?

Not the most elegant - but is a combination of a few answers - something like this will work:
d=2020-01-01
end_date=2020-02-03
while [ "$d" != $end_date ]; do
d2=$(date -I -d "$d - 1 day")
d=$(date -I -d "$d + 1 day")
echo $d2
echo $d
find -type f -newerct "${d2}" ! -newerct "${d}" -printf "%T# %Tc %p\n" | sort -n | tail -n +2 | awk '{print $9}' | xargs rm
done
I'd suggest adding paths and hashing out the xargs rm bit (just to print and double check what you're removing).
There's probably a more elegant way to do this other than the print stuff but it works.

For common filenames
find /some_folder/ -name "file_prefix*" -mtime +30 -printf '%TD %TT %p\n' |
sort |
awk '{if ($1==prevdate) print $3; prevdate=$1}' |
xargs rm
The find command will print %TD %TT %p, i.e. the last modification date followed by the last modification time and then the filepath (folder and filename).
The list is sorted by sort. Because of the date/time/filepath structure, this will sort by date then time so that the oldest files are printed first, which is important afterward.
awk parses each line and calls {if ($1==prevdate) print $3; prevdate=$1}. Because of the date/time/filepath structure, the date is $1, the time is $2 and the filepath is $3. This prints the filepath whenever the date if similar to the previously parsed date. So, the first file of the day is not printed out because its date differs from the date of the previous day, and all subsequent files of the same day are printed. Please note that prevdate is initially unassigned which is roughly equivalent to the null string. You can call this if you find it more readable:
awk 'BEGIN{prevdate=""} {if ($1==prevdate) print $3; prevdate=$1}'
Finally, xargs rm will call rm for each line from the standard input which, at this moment, contains the list of files printed by awk.
Handling spaces
If your filenames contain space characters, the solution can be slightly adjusted:
find /some_folder/ -name "file_prefix*" -mtime +30 -printf '%TD %TT %p\n' |
sort |
awk '{if ($1==prevdate) print; prevdate=$1}' |
cut -d ' ' -f3- |
xargs rm
awk prints the whole line instead of printing the filepath only, then the filename is extracted with cut -d ' ' -f3- before calling xargs rm.
Handling weird filenames
The above solutions do not work with filenames containing newlines and possibly won’t work with backslashes either.
I assume you won’t run into these issues because if you want to clean up the directory, chances are you already know what’s inside this directory, and these are probably log files or other type of files created automatically.
However, should you need to handle all types of filenames, the command below will do the trick:
unset prevdate currentdate filepath
find /some_folder/ -name "file_prefix*" -mtime +30 -printf '%TD %TT %p\0' |
sort -z |
while IFS= read -r -d '' line
do
currentdate=${line%% *}
if [ "$currentdate" = "$prevdate" ]
then
filepath=$(cut -d ' ' -f3- <<< $line)
rm -- "$filepath"
fi
prevdate=$currentdate
done
It behaves essentially like the initial solution but strings are separated by the null character (which is a forbidden character in a filename) instead of the traditional newline separation.
find outputs results with %TD %TT %p\0 instead of %TD %TT %p\n which means results are null-separated, then sort -z makes use of this null-separated result, and finally the while loop is a rewrite of the awk script, but makes use of null-separated strings (which is almost impossible to do with awk). There is no call to xargs rm because rm is called directly inside the while loop.
While the ability to handle all types of filenames is tempting, please note that this solution is significantly less efficient than the other solutions. The code that I wrote is non-optimal for educational purposes, but it would still be slower even if I optimized it.
Same date and time
If several "first file of the day" occur at the exact same time within a same day, it will only skip the first file with the "lowest" file path, i.e. sorted by its alphanumeric characters. If you want to keep all first files of the day with the exact same time, this is slightly more complicated but this is doable.

Related

How to use GNU find command to find files by pattern and list files in order of most recent modification to least?

I want to use the GNU find command to find files based on a pattern, and then have them displayed in order of the most recently modified file to the least recently modified.
I understand this:
find / -type f -name '*.md'
but then what would be added to sort the files from the most recently modified to the least?
find can't sort files, so you can instead output the modification time plus filename, sort on modification time, then remove the modification time again:
find . -type f -name '*.md' -printf '%T# %p\0' | # Print time+name
sort -rnz | # Sort numerically, descending
cut -z -d ' ' -f 2- | # Remove time
tr '\0' '\n' # Optional: make human readable
This uses \0-separated entries to avoid problems with any kind of filenames. You can pass this directly and safely to a number of tools, but here it instead pipes to tr to show the file list as individual lines.
find <dir> -name "*.mz" -printf "%Ts - %h/%f\n" | sort -rn
Print the modified time in epoch format (%Ts) as well as the directories (%h) and file name (%f). Pipe this through to sort -rn to sort in reversed number order.
Pipe the output of find to xargs and ls:
find / -type f -name '*.md' | xargs ls -1t

Linux piping find and md5sum not sending output

Trying to loop every file, do some cutting, extract the first 4 characters of the MD5.
Here's what I got so far:
find . -name *.jpg | cut -f4 -d/ | cut -f1 -d. | md5sum | head -c 4
Problem is, I don't see any more output at this point. How can I send output to md5sum and continue sending the result?
md5sum reads everything from stdin till end of file (eof) and outputs md5 sum of full file. You should separate input into lines and run md5sum per line, for example with while read var loop:
find . -name *.jpg | cut -f4 -d/ | cut -f1 -d. |
while read -r a;
do echo -n $a| md5sum | head -c 4;
done
read builtin bash command will read one line from input into shell variable $a; while loop will run loop body (commands between do and done) for every return from read, and $a will be the current line. -r option of read is to not convert backslash; -n option of echo command will not add newline (if you want newline, remove -n option of echo).
This will be slow for thousands of files and more, as there are several forks/execs for every file inside loop. Faster will be some scripting with perl or python or nodejs or any other scripting language with builtin md5 hash computing (or with some library).
You can do what you are attempting to do with a short "helper" script that you call from find. For example, you could create a short script to find the basename of each file passed as an argument, remove the '.jpg' extension, and then provide the remaining name w/o extension as input to md5sum on stdin to get the md5sum of the name itself. Call the script anything you like, say namemd5.sh. Example:
#!/bin/bash
[ -z "$1" ] && exit 1 ## validate single argument
fname=$(basename "$1") ## get the filename alone
fname="${fname%.jpg}" ## remove .jpg extension
fnsum=$(md5sum - <<<"$fname") ## get md5sum of name w/o .jpg
fnsum=${fnsum%% *} ## remove trailing ' -'
echo "$fnsum - $fname" ## output md5sum - name
## (remove ' - $fname' for md5sum alone)
(note: the name is provided as part of the output for example purposes, remove if you want the md5sum alone as shown in the comment above)
Example Files
$ find /home/david/img/wp/ -type f -name "*.jpg"
/home/david/img/wp/hacker_manifesto_1200x900.jpg
/home/david/img/wp/hacker_manifesto_by_otalicus.jpg
/home/david/img/wp/reflections-triple-1920x1200.jpg
/home/david/img/wp/hacker_wallpaper_1600x900.jpg
/home/david/img/wp/Zen.jpg
/home/david/img/wp/hacker_wallpaper_by_vanilla23-dot254.jpg
/home/david/img/wp/hacker_manifesto_1600x900.jpg
Example Use/Output
$ find /home/david/img/wp/ -type f -name "*.jpg" -exec ./namemd5.sh '{}' \;
0f7d2aac158eb9f7842215e14ff6573c - hacker_manifesto_1200x900
604bc695a0bb70b8db0352267caf226f - hacker_manifesto_by_otalicus
5decea0e306f185bf988ac9934ec0e2c - reflections-triple-1920x1200
82bd8e1ad3df588eb0e0848c5f764812 - hacker_wallpaper_1600x900
0f4daba431a22c03f28977f087e4c695 - Zen
0c55cd3ebd2a847e10c20d86e80e6ceb - hacker_wallpaper_by_vanilla23-dot254
e5c1da0c2db3827d2bf81c306633cc56 - hacker_manifesto_1600x900
You can also call the script with the -execdir version within find as well, e.g.
$ find /home/david/img/wp/ -type f -name "*.jpg" -execdir \
/full/path/to/namemd5.sh '{}' \;
(note: the use of the /full/path to your helper script above)
How to find all .jpg file then execute md5sum then cut first 4 caracters:
find . -name '*.jpg' -exec md5sum {} \; | cut -b 1-4

How do I classify files in Linux server by their names?

How can use the ls command and options to list the repetitious filenames that are in different directories?
You can't use a single, basic ls command to do this. You'd have to use a combination of other POSIX/Unix/GNU utilities. For example, to find the duplicate filenames first:
find . -type f -exec basename "\{}" \; | sort | uniq -d > dupes
This means find all the files (-type f) through the entire directory hierarchy in the current directory (.), and execute (-exec) the command basename (which strips the directory portion) on the found file (\{}), end of command (\;). These files then sort and print out duplicate lines (uniq -d). The result goes in the file dupes. Now you have the filenames that are duplicated, but you don't know what directory they are in. Use find again to find them. Using bash as your shell:
while read filename; do find . -name "$filename" -print; done < dupes
This means loop through (while) all contents of file dupes and read into the variable filename each line. For each line, execute find again and search for the specific -name of the $filename and print it out (-print, but it's implicit so this is redundant).
Truth be told you can combine these without using an intermediate file:
find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
If you're not familiar with it, the | operator means, execute the following command using the output of the previous command as the input to the following command. Example:
eje#EEWANCO-PC:~$ mkdir test
eje#EEWANCO-PC:~$ cd test
eje#EEWANCO-PC:~/test$ mkdir 1 2 3 4 5
eje#EEWANCO-PC:~/test$ mkdir 1/2 2/3
eje#EEWANCO-PC:~/test$ touch 1/0000 2/1111 3/2222 4/2222 5/0000 1/2/1111 2/3/4444
eje#EEWANCO-PC:~/test$ find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
./1/0000
./5/0000
./1/2/1111
./2/1111
./3/2222
./4/2222
Disclaimer: The requirement stated that the filenames were all numbers. While I have tried to design the code to handle filenames with spaces (and in tests on my system, it works), the code may break when it encounters special characters, newlines, nuls, or other unusual situations. Please note that the -exec parameter has special security considerations and should not be used by root over arbitrary user files. The simplified example provided is intended for illustrative and didactic purposes only. Please consult your man pages and relevant CERT advisories for full security implications.
I have a function in my bash profile (bash 4.4) for duplicate files.
It is true that find is the correct tool.
I use find combined with -print0 options which separates the find results with null char instead of new lines (default find action). Now i can catch all files under current directory and subdirectories.
This will ensure that results will be correct no matter if filenames contain special chars like spaces or new lines (in some very rare cases). Instead of double running find against find, you can built an array and just locate the duplicate files in this array. Then you grep the whole array using the "duplicates" as pattern.
So something like this works ok for my function:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
This is a test:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
# find all files and load them in an array using null delimiter
$ printf '%s\n' "${fn[#]}" #print the array
./tmp/file7
./tmp/file14
./tmp/file11
./tmp/file8
./tmp/file9
./tmp/tmp2/file09 99
./tmp/tmp2/file14.txt
./tmp/tmp2/file15.txt
./tmp/tmp2/file$100
./tmp/tmp2/file14.txt.bak
./tmp/tmp2/file15.txt.bak
./tmp/file1
./tmp/file4
./file09 99
./file14
./file$100
./file1
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
#Locate duplicate files
$ echo "$dupes"
\<file$100\>$ #Mind this one with special char $ in filename
\<file09 99\>$ #Mind also this one with spaces
\<file14\>$
\<file1\>$
#I have on purpose enclose the results between \<...\> to force grep later to capture full words and avoid file1 to match file1.txt or file11
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
file$100 ==> ./file$100 #File with special char correctly captured
file$100 ==> ./tmp/tmp2/file$100
file09 99 ==> ./file09 99 #File with spaces in name also correctly captured
file09 99 ==> ./tmp/tmp2/file09 99
file1 ==> ./file1
file1 ==> ./tmp/file1
file14 ==> ./file14 #other files named file14 like file14.txt and file14.txt.bak not captured since they are not duplicates.
file14 ==> ./tmp/file14
Tips:
This one <(printf '\<%s\>$\n' "${fn[#]##*/}") uses process substitution on the basename of the find results using bash built in parameter expansion techniques.
LC_ALL=C is required on sorting in order filenames to be sorted correctly.
In bash versions before 4.4 , the readarray does not accept -d option (delimiter). In this case you can transform find results to an array with
while IFS= read -r -d '' res;do fn+=( "$res" );done < <(find.... -print0)

Adding file size and other info to a find

I am currently using the following linux command:
find /folder -size +1000k | grep txt
to find in the "folder" any files with "txt" in it that has a size greater than 1000k bytes.
This succesfully returns the list of files that I want. But I would also like to print out the file size, and if I can, to see last date modified within the returned list of files (much like what the command ll returns)
I tried using printf %s, but this simply returns a list of numbers, thus the grep doesn't work.
First -- there's absolutely no reason to use grep on output from find; you can just tell find to do the filtering itself.
Second -- the -printf action takes a format string which can have more than one specifier. For instance, %s %P\n, to print first size, then name, then a newline. (This ordering is desirable since size will always be a single string of digits, whereas a name can be undefined -- so putting the name first makes it harder to parse).
find /folder -size +1000k -name '*txt*' -printf '%s %P\n'
Mind you -- to be completely unambiguously parsed, you'll want to use NUL specifiers rather than newlines, since newlines are valid inside filenames.
An example which reads filenames and sizes into a pair of bash arrays, after sorting by size:
files=( )
sizes=( )
while IFS= read -r -d' ' size && IFS= read -r -d '' filename; do
files+=( "$filename" )
sizes+=( "$size" )
done < <(find /folder -size +1000k -name '*txt*' -printf '%s %P\0' | sort -n -z)

Copy the three newest files under one directory (recursively) to another specified directory

I'm using bash.
Suppose I have a log file directory /var/myprogram/logs/.
Under this directory I have many sub-directories and sub-sub-directories that include different types of log files from my program.
I'd like to find the three newest files (modified most recently), whose name starts with 2010, under /var/myprogram/logs/, regardless of sub-directory and copy them to my home directory.
Here's what I would do manually
1. Go through each directory and do ls -lt 2010*
to see which files starting with 2010 are modified most recently.
2. Once I go through all directories, I'd know which three files are the newest. So I copy them manually to my home directory.
This is pretty tedious, so I wondered if maybe I could somehow pipe some commands together to do this in one step, preferably without using shell scripts?
I've been looking into find, ls, head, and awk that I might be able to use but haven't figured the right way to glue them together.
Let me know if I need to clarify. Thanks.
Here's how you can do it:
find -type f -name '2010*' -printf "%C#\t%P\n" |sort -r -k1,1 |head -3 |cut -f 2-
This outputs a list of files prefixed by their last change time, sorts them based on that value, takes the top 3 and removes the timestamp.
Your answers feel very complicated, how about
for FILE in find . -type d; do ls -t -1 -F $FILE | grep -v "/" | head -n3 | xargs -I{} mv {} ..; done;
or laid out nicely
for FILE in `find . -type d`;
do
ls -t -1 -F $FILE | grep -v "/" | grep "^2010" | head -n3 | xargs -I{} mv {} ~;
done;
My "shortest" answer after quickly hacking it up.
for file in $(find . -iname *.php -mtime 1 | xargs ls -l | awk '{ print $6" "$7" "$8" "$9 }' | sort | sed -n '1,3p' | awk '{ print $4 }'); do cp $file ../; done
The main command stored in $() does the following:
Find all files recursively in current directory matching (case insensitive) the name *.php and having been modified in the last 24 hours.
Pipe to ls -l, required to be able to sort by modification date, so we can have the first three
Extract the modification date and file name/path with awk
Sort these files based on datetime
With sed print only the first 3 files
With awk print only their name/path
Used in a for loop and as action copy them to the desired location.
Or use #Hasturkun's variant, which popped as a response while I was editing this post :)

Resources