Delete the first 10 largest regular files using shell script - linux

I'm trying to delete the first largest regular files from the given directory, but it doesn't work for files which contain whitespace caracters.
My code (it works if the files doesn't contain whitespace caracters):
find mydir -type f -exec du -ahb {} + | sort -n -r | cut -f2 | head -n 10 | xargs rm -i
I also tried this, but it gives an error message:
find mydir -type f -exec du -ahb {} + -print 0 | sort -n -r | cut -f2 | head -n 10 | xargs -0 rm -i

The following should work at least with GNU coreutils 8.25 and newer :
find mydir -type f -exec du -0b {} + | sort -znr | cut -zf2 | head -zn 10 | xargs -0pn 1 rm
I made sure every command handled and outputted NUL bytes (\0) separated records rather than linefeed separated records :
du outputs NUL-separated records with -0
sort, cut and head handle and output NUL-separated records with -z
xargs handles NUL-separated records with -0
Additionally, I removed the interactive mode of rm and asked xargs to handle that instead (-p), because xargs didn't provide a prompt to rm when invoking it. I had to limit the number of parameters given at once to rm to 1 for this to work (xargs' -n 1 parameter). There might be a way to preserve the -i and provide rm with an interface to your prompt, but I don't know how.
Last point : I removed du's -human-readable mode because it would have made the sort often fail and it didn't serve any purpose since the filesizes were never displayed to an human.

Related

Shell script to delete all folders but the last 5 ones

I'm looking for a variant on the following script to delete all folders except the last 5 created:
find ./ -type d -ctime +10 -exec rm -rf {} +
So this will delete everything older than 10 days.
But the time factor in my case does not always apply. I need a similar script to delete folders, but I always want to keep the last 5 created folders (by date).
So when there are 100 folders, it needs to delete 95 of them and keep the last 5 created.
When there are 5, it needs to keep them all.
When there are 6, it needs to delete only the first created and keep the other 5.
First find the files printed with a leading date in column #1, sort by date, omit the last 5 newest items from the list, (the ones to keep), remove column #1, and then rm whatever older directories are left. Test code with echo first:
find . -type d -printf '%T# %p\n' | sort -n |
head -n -5 | cut -f2- -d" " | xargs -0 echo rm -rd
...and only if it looks OK, remove the echo to do the deed:
find . -type d -printf '%T# %p\n' | sort -n |
head -n -5 | cut -f2- -d" " | xargs -0 rm -rd
Some of the above code stolen from plundra's answer to "How to recursively find the latest modified file in a directory?"
Pretty much untested, hence the 'ls -ld' at the end! :)
find ./ -type d -printf '%T# %p\n' | sort -nr | cut -d' ' -f2- | grep "^....*" | tail --lines=+5 | xargs -i ls -ld {}

Using pipes with find command in linux

I would like to find files in my home directory that start with '~', sort them numerically, print the first five and delete them using find command and pipes in Linux. I have a bash script:
#!/bin/bash
find ~/ -name "~*" | sort -n | head -5 | tee | xargs rm
This works fine for deleting files, but I was expecting tee command to print deleted files to standard output. All this command does is delete files, but there in so output in terminal. What should I add/ change?
Thank you.
You could just use the verbose flag on rm and it will tell you what it's deleting
find ~/ -name "~*" | sort -n | head -5 | xargs rm -v
Use man rm to see the docs
-v, --verbose
explain what is being done
You can use rm -v to print each deleting filename:
find ~ -name '~*' -print0 | sort -zn | head -z -n 5 | xargs -0 rm -v
Also note use -print0 and all corresponding options in sort. head, xargs to address filenames with whitespace and glob characters.

How to delete X number of files in a directory

To get X number of files in a directory, I can do:
$ ls -U | head -40000
How would I then delete these 40,000 files? For example, something like:
$ "rm -rf" (ls -U | head -40000)
The tool you need for this is xargs. It will convert standard input into arguments to a command that you specify. Each line of the input is treated as a single argument.
Thus, something like this would work (see the comment below, though, ls shouldn't be parsed this way normally):
ls -U | head -40000 | xargs rm -rf
I would recommend before trying this to start with a small head size and use xargs echo to print out the filenames being passed so you understand what you'll be deleting.
Be aware if you have files with weird characters that this can sometimes be a problem. If you are on a modern GNU system you may also wish to use the arguments to these commands that use null characters to separate each element. Since a filename cannot contain a null character that will safely parse all possible names. I am not aware of a simple way to take the top X items when they are zero separated.
So, for example you can use this to delete all files in a directory
find . -maxdepth 1 -print0 | xargs -0 rm -rf
Use a bash array and slice it. If the number and size of arguments is likely to get close to the system's limits, you can still use xargs to split up the remainder.
files=( * )
printf '%s\0' "${files[#]:0:40000}" | xargs -0 rm
What about using awk as the filter?
find "$FOLDER" -maxdepth 1 -mindepth 1 -print0 \
| awk -v limit=40000 'NR<=limit;NR>limit{exit}' RS="\0" ORS="\0" \
| xargs -0 rm -rf
It will reliably remove at most 40.000 files (or folders). Reliably means regardless of which characters the filenames may contain.
Btw, to get the number of files in a directory reliably you can do:
find FOLDER -mindepth 1 -maxdepth 1 -printf '.' | wc -c
I ended up doing this since my folders were named with sequential numbers. This should also work for alphabetical folders:
ls -r releases/ | sed '1,3d' | xargs -I {} rm -rf releases/{}
Details:
list all the items in the releases/ folder in reverse order
slice off the first 3 items (which would be the newest if numeric/alpha naming)
for each item, rm it
In your case, you can replace ls -r with ls -U and 1,3d with 1,40000d. That should be the same, I believe.

Linux command: How to 'find' only text files?

After a few searches from Google, what I come up with is:
find my_folder -type f -exec grep -l "needle text" {} \; -exec file {} \; | grep text
which is very unhandy and outputs unneeded texts such as mime type information. Any better solutions? I have lots of images and other binary files in the same folder with a lot of text files that I need to search through.
I know this is an old thread, but I stumbled across it and thought I'd share my method which I have found to be a very fast way to use find to find only non-binary files:
find . -type f -exec grep -Iq . {} \; -print
The -I option to grep tells it to immediately ignore binary files and the . option along with the -q will make it immediately match text files so it goes very fast. You can change the -print to a -print0 for piping into an xargs -0 or something if you are concerned about spaces (thanks for the tip, #lucas.werkmeister!)
Also the first dot is only necessary for certain BSD versions of find such as on OS X, but it doesn't hurt anything just having it there all the time if you want to put this in an alias or something.
EDIT: As #ruslan correctly pointed out, the -and can be omitted since it is implied.
Based on this SO question :
grep -rIl "needle text" my_folder
Why is it unhandy? If you need to use it often, and don't want to type it every time just define a bash function for it:
function findTextInAsciiFiles {
# usage: findTextInAsciiFiles DIRECTORY NEEDLE_TEXT
find "$1" -type f -exec grep -l "$2" {} \; -exec file {} \; | grep text
}
put it in your .bashrc and then just run:
findTextInAsciiFiles your_folder "needle text"
whenever you want.
EDIT to reflect OP's edit:
if you want to cut out mime informations you could just add a further stage to the pipeline that filters out mime informations. This should do the trick, by taking only what comes before :: cut -d':' -f1:
function findTextInAsciiFiles {
# usage: findTextInAsciiFiles DIRECTORY NEEDLE_TEXT
find "$1" -type f -exec grep -l "$2" {} \; -exec file {} \; | grep text | cut -d ':' -f1
}
find . -type f -print0 | xargs -0 file | grep -P text | cut -d: -f1 | xargs grep -Pil "search"
This is unfortunately not space save. Putting this into bash script makes it a bit easier.
This is space safe:
#!/bin/bash
#if [ ! "$1" ] ; then
echo "Usage: $0 <search>";
exit
fi
find . -type f -print0 \
| xargs -0 file \
| grep -P text \
| cut -d: -f1 \
| xargs -i% grep -Pil "$1" "%"
Another way of doing this:
# find . |xargs file {} \; |grep "ASCII text"
If you want empty files too:
# find . |xargs file {} \; |egrep "ASCII text|empty"
How about this:
$ grep -rl "needle text" my_folder | tr '\n' '\0' | xargs -r -0 file | grep -e ':[^:]*text[^:]*$' | grep -v -e 'executable'
If you want the filenames without the file types, just add a final sed filter.
$ grep -rl "needle text" my_folder | tr '\n' '\0' | xargs -r -0 file | grep -e ':[^:]*text[^:]*$' | grep -v -e 'executable' | sed 's|:[^:]*$||'
You can filter-out unneeded file types by adding more -e 'type' options to the last grep command.
EDIT:
If your xargs version supports the -d option, the commands above become simpler:
$ grep -rl "needle text" my_folder | xargs -d '\n' -r file | grep -e ':[^:]*text[^:]*$' | grep -v -e 'executable' | sed 's|:[^:]*$||'
Here's how I've done it ...
1 . make a small script to test if a file is plain text
istext:
#!/bin/bash
[[ "$(file -bi $1)" == *"file"* ]]
2 . use find as before
find . -type f -exec istext {} \; -exec grep -nHi mystring {} \;
Here's a simplified version with extended explanation for beginners like me who are trying to learn how to put more than one command in one line.
If you were to write out the problem in steps, it would look like this:
// For every file in this directory
// Check the filetype
// If it's an ASCII file, then print out the filename
To achieve this, we can use three UNIX commands: find, file, and grep.
find will check every file in the directory.
file will give us the filetype. In our case, we're looking for a return of 'ASCII text'
grep will look for the keyword 'ASCII' in the output from file
So how can we string these together in a single line? There are multiple ways to do it, but I find that doing it in order of our pseudo-code makes the most sense (especially to a beginner like me).
find ./ -exec file {} ";" | grep 'ASCII'
Looks complicated, but not bad when we break it down:
find ./ = look through every file in this directory. The find command prints out the filename of any file that matches the 'expression', or whatever comes after the path, which in our case is the current directory or ./
The most important thing to understand is that everything after that first bit is going to be evaluated as either True or False. If True, the file name will get printed out. If not, then the command moves on.
-exec = this flag is an option within the find command that allows us to use the result of some other command as the search expression. It's like calling a function within a function.
file {} = the command being called inside of find. The file command returns a string that tells you the filetype of a file. Regularly, it would look like this: file mytextfile.txt. In our case, we want it to use whatever file is being looked at by the find command, so we put in the curly braces {} to act as an empty variable, or parameter. In other words, we're just asking for the system to output a string for every file in the directory.
";" = this is required by find and is the punctuation mark at the end of our -exec command. See the manual for 'find' for more explanation if you need it by running man find.
| grep 'ASCII' = | is a pipe. Pipe take the output of whatever is on the left and uses it as input to whatever is on the right. It takes the output of the find command (a string that is the filetype of a single file) and tests it to see if it contains the string 'ASCII'. If it does, it returns true.
NOW, the expression to the right of find ./ will return true when the grep command returns true. Voila.
I have two issues with histumness' answer:
It only list text files. It does not actually search them as
requested. To actually search, use
find . -type f -exec grep -Iq . {} \; -and -print0 | xargs -0 grep "needle text"
It spawns a grep process for every file, which is very slow. A better solution is then
find . -type f -print0 | xargs -0 grep -IZl . | xargs -0 grep "needle text"
or simply
find . -type f -print0 | xargs -0 grep -I "needle text"
This only takes 0.2s compared to 4s for solution above (2.5GB data / 7700 files), i.e. 20x faster.
Also, nobody cited ag, the Silver Searcher or ack-grep¸as alternatives. If one of these are available, they are much better alternatives:
ag -t "needle text" # Much faster than ack
ack -t "needle text" # or ack-grep
As a last note, beware of false positives (binary files taken as text files). I already had false positive using either grep/ag/ack, so better list the matched files first before editing the files.
Although it is an old question, I think this info bellow will add to the quality of the answers here.
When ignoring files with the executable bit set, I just use this command:
find . ! -perm -111
To keep it from recursively enter into other directories:
find . -maxdepth 1 ! -perm -111
No need for pipes to mix lots of commands, just the powerful plain find command.
Disclaimer: it is not exactly what OP asked, because it doesn't check if the file is binary or not. It will, for example, filter out bash script files, that are text themselves but have the executable bit set.
That said, I hope this is useful to anyone.
I do it this way:
1) since there're too many files (~30k) to search thru, I generate the text file list daily for use via crontab using below command:
find /to/src/folder -type f -exec file {} \; | grep text | cut -d: -f1 > ~/.src_list &
2) create a function in .bashrc:
findex() {
cat ~/.src_list | xargs grep "$*" 2>/dev/null
}
Then I can use below command to do the search:
findex "needle text"
HTH:)
I prefer xargs
find . -type f | xargs grep -I "needle text"
if your filenames are weird look up using the -0 options:
find . -type f -print0 | xargs -0 grep -I "needle text"
bash example to serach text "eth0" in /etc in all text/ascii files
grep eth0 $(find /etc/ -type f -exec file {} \; | egrep -i "text|ascii" | cut -d ':' -f1)
If you are interested in finding any file type by their magic bytes using the awesome file utility combined with power of find, this can come in handy:
$ # Let's make some test files
$ mkdir ASCII-finder
$ cd ASCII-finder
$ dd if=/dev/urandom of=binary.file bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.009023 s, 116 MB/s
$ file binary.file
binary.file: data
$ echo 123 > text.txt
$ # Let the magic begin
$ find -type f -print0 | \
xargs -0 -I ## bash -c 'file "$#" | grep ASCII &>/dev/null && echo "file is ASCII: $#"' -- ##
Output:
file is ASCII: ./text.txt
Legend: $ is the interactive shell prompt where we enter our commands
You can modify the part after && to call some other script or do some other stuff inline as well, i.e. if that file contains given string, cat the entire file or look for a secondary string in it.
Explanation:
find items that are files
Make xargs feed each item as a line into one liner bash
command/script
file checks type of file by magic byte, grep checks if ASCII
exists, if so, then after && your next command executes.
find prints results null separated, this is good to escape
filenames with spaces and meta-characters in it.
xargs , using -0 option, reads them null separated, -I ##
takes each record and uses as positional parameter/args to bash
script.
-- for bash ensures whatever comes after it is an argument even
if it starts with - like -c which could otherwise be interpreted
as bash option
If you need to find types other than ASCII, simply replace grep ASCII with other type, like grep "PDF document, version 1.4"
find . -type f | xargs file | grep "ASCII text" | awk -F: '{print $1}'
Use find command to list all files, use file command to verify they are text (not tar,key), finally use awk command to filter and print the result.
How about this
find . -type f|xargs grep "needle text"

Shell script to count files, then remove oldest files

I am new to shell scripting, so I need some help here. I have a directory that fills up with backups. If I have more than 10 backup files, I would like to remove the oldest files, so that the 10 newest backup files are the only ones that are left.
So far, I know how to count the files, which seems easy enough, but how do I then remove the oldest files, if the count is over 10?
if [ls /backups | wc -l > 10]
then
echo "More than 10"
fi
Try this:
ls -t | sed -e '1,10d' | xargs -d '\n' rm
This should handle all characters (except newlines) in a file name.
What's going on here?
ls -t lists all files in the current directory in decreasing order of modification time. Ie, the most recently modified files are first, one file name per line.
sed -e '1,10d' deletes the first 10 lines, ie, the 10 newest files. I use this instead of tail because I can never remember whether I need tail -n +10 or tail -n +11.
xargs -d '\n' rm collects each input line (without the terminating newline) and passes each line as an argument to rm.
As with anything of this sort, please experiment in a safe place.
find is the common tool for this kind of task :
find ./my_dir -mtime +10 -type f -delete
EXPLANATIONS
./my_dir your directory (replace with your own)
-mtime +10 older than 10 days
-type f only files
-delete no surprise. Remove it to test your find filter before executing the whole command
And take care that ./my_dir exists to avoid bad surprises !
Make sure your pwd is the correct directory to delete the files then(assuming only regular characters in the filename):
ls -A1t | tail -n +11 | xargs rm
keeps the newest 10 files. I use this with camera program 'motion' to keep the most recent frame grab files. Thanks to all proceeding answers because you showed me how to do it.
The proper way to do this type of thing is with logrotate.
I like the answers from #Dennis Williamson and #Dale Hagglund. (+1 to each)
Here's another way to do it using find (with the -newer test) that is similar to what you started with.
This was done in bash on cygwin...
if [[ $(ls /backups | wc -l) > 10 ]]
then
find /backups ! -newer $(ls -t | sed '11!d') -exec rm {} \;
fi
Straightforward file counter:
max=12
n=0
ls -1t *.dat |
while read file; do
n=$((n+1))
if [[ $n -gt $max ]]; then
rm -f "$file"
fi
done
I just found this topic and the solution from mikecolley helped me in a first step. As I needed a solution for a single line homematic (raspberrymatic) script, I ran into a problem that this command only gave me the fileames and not the whole path which is needed for "rm". My used CUxD Exec command can not start in a selected folder.
So here is my solution:
ls -A1t $(find /media/usb0/backup/ -type f -name homematic-raspi*.sbk) | tail -n +11 | xargs rm
Explaining:
find /media/usb0/backup/ -type f -name homematic-raspi*.sbk searching only files -type f whiche are named like -name homematic-raspi*.sbk (case sensitive) or use -iname (case insensitive) in folder /media/usb0/backup/
ls -A1t $(...) list the files given by find without files starting with "." or ".." -A sorted by mtime -t and with a return of only one column -1
tail -n +11 return of only the last 10 -n +11 lines for following rm
xargs rm and finally remove the raiming files in the list
Maybe this helps others from longer searching and makes the solution more flexible.
stat -c "%Y %n" * | sort -rn | head -n +10 | \
cut -d ' ' -f 1 --complement | xargs -d '\n' rm
Breakdown: Get last-modified times for each file (in the format "time filename"), sort them from oldest to newest, keep all but the last ten entries, and then keep all but the first field (keep only the filename portion).
Edit: Using cut instead of awk since the latter is not always available
Edit 2: Now handles filenames with spaces
On a very limited chroot environment, we had only a couple of programs available to achieve what was initially asked. We solved it that way:
MIN_FILES=5
FILE_COUNT=$(ls -l | grep -c ^d )
if [ $MIN_FILES -lt $FILE_COUNT ]; then
while [ $MIN_FILES -lt $FILE_COUNT ]; do
FILE_COUNT=$[$FILE_COUNT-1]
FILE_TO_DEL=$(ls -t | tail -n1)
# be careful with this one
rm -rf "$FILE_TO_DEL"
done
fi
Explanation:
FILE_COUNT=$(ls -l | grep -c ^d ) counts all files in the current folder. Instead of grep we could use also wc -l but wc was not installed on that host.
FILE_COUNT=$[$FILE_COUNT-1] update the current $FILE_COUNT
FILE_TO_DEL=$(ls -t | tail -n1) Save the oldest file name in the $FILE_TO_DEL variable. tail -n1 returns the last element in the list.
Based on others suggestions and some awk foo, I got this to work. I know this an old thread, but I didn't find a decent answer here and this sorted it for me. This just deletes the oldest file, but you can change the head -n 1 to 10 and get the oldest 10.
find $DIR -type f -printf '%T+ %p\n' | sort | head -n 1 | awk '{first =$1; $1 =""; print $0}' | xargs -d '\n' rm
Using inode numbers via stat & find command (to avoid pesky-chars-in-file-name issues):
stat -f "%m %i" * | sort -rn -k 1,1 | tail -n +11 | cut -d " " -f 2 | \
xargs -n 1 -I '{}' find "$(pwd)" -type f -inum '{}' -print
#stat -f "%m %i" * | sort -rn -k 1,1 | tail -n +11 | cut -d " " -f 2 | \
# xargs -n 1 -I '{}' find "$(pwd)" -type f -inum '{}' -delete

Resources