bash - delete list of directories (including their contents) - linux

The below script gives this error:
rm: illegal option -- 4
rm: illegal option -- 5
rm: illegal option -- 4
rm: illegal option -- 3
rm: illegal option -- 2
the script:
#!/bin/bash
keep_no=$1+1
cd "/mydirec/"
rm -rf `ls | sort -nr | tail +$keep_no`
I would like the script to accept an argument (of num of direcs to keep) then remove all directories (including their containing files) except for the (number passed in the script - ordering by the numerical direc names in descending order).
ie if /mydirec/ contains these direc names:
53
92
8
152
77
and the script is called like: bash del.sh 2
then /mydirec/ should contains these direcs (as it removes those that aren't the top 2 in desc order):
152
92
Can someone please help with the syntax?

Should read:
rm -rf `ls | sort -nr | tail -n +$keep_no`
But it is good practice not to parse ls output. Use find instead.
#!/bin/bash
keep_no=$(( $1+1 ))
directory="./mydirec/"
cd $directory
rm -rf `find . -maxdepth 1 -mindepth 1 -type d -printf '%f\n'| sort -nr | tail -n +$keep_no`
cd -

#!/bin/bash
if [[ -z "$1" ]]; then
echo "syntax is..."
exit 1
fi
keep_no=$(( $1 + 1 ))
cd "/mydirec/"
IFS='
'; # record separator: only enter inside single quotes
echo rm -rf $(ls | sort -nr | tail +$keep_no)
Verify the output of the script manually, then execute the script through sh:
./your_script.sh | sh -x

If you want to leave two directories (not to delete) you need to calculate total number of directories. And xargs utility is more convenient way to pass a list of arguments to rm.
#!/bin/bash
dir="/yourdir"
total_no=`ls | wc -l`
keep_no=$(( $total_no - $1 ))
ls | sort -nr | tail -n $keep_no | xargs rm -rf

Related

Linux: what's a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?
On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.

How to delete files similar to items in a string

I want to delete files in the current folder with the following pattern.
0_something.sql.tar
I have a string provided which contains numbers
number_string="0,1,2,3,4"
How can I delete any files not included in the number_string while also keeping to the x_x.sql.tar pattern?
For example, I have these files:
0_something.sql.tar
2_something.sql.tar
4_something.sql.tar
15_something.sql.tar
Based on this logic, and the numbers in the number string - I should only remove 15 because:
It follows the pattern _.sql.tar
It doesnt have a number
in the number string
This might help you out:
s="0,1,2,3,4"
s=",${s},"
for f in *.sql.tar; do
n="${f%_*}"
[ "${n//[0-9]}" ] && continue
[ "$s" == "${s/,${n},/}" ] && echo rm -- "$f"
done
Remove the echo if this answer pleases you
What this is doing is the following:
convert your number_string s into a string which is fully comma-separated and
also starts and ends with a comma (s=",0,1,2,3,"). This allows us to search for entries like ,5,
loop over all files matched by the glob *.sql.tar
n="${f%_*}": Extract the substring before the first underscore `
[ "{n//[0-9]}" ] && continue: validate if the substring is an integer, if not, skip the file and move to the next one.
substitute the number in the number_string (with commas), if the substring does not change, it implies we should not keep the file
# Get the unmatched numbers from the second stream
# ie. files to be removed
join -v2 -o2.2 <(
# output sorted numbers on separate lines
sort <<<${number_string//,/$'\n'}
) <(
# fins all files named in such way
# and print filename, tab and path separated by newlines
find . -name '[0-9]*_something.sql.tar' -printf "%f\t%p\n" |
# extract numbers from filenames only
sed 's/\([0-9]*\)[^\t]*/\1/' |
# sort for join
sort
) |
# pass the input to xargs
# remove echo to really remove files
xargs -d '\n' echo rm
Tested on repl
$IFS can help here.
( IFS=,; for n in $number_string; do echo rm $n\_something.sql.tar; done; )
The parens run the command in a subshell so the reassignment of IFS is scoped.
Setting it to a comma lets the command parser split the string into discrete numbers for you and loop over them.
If that gives you the right list of commands you want to execute, just take out the echo. :)
UPDATE
OH! I see that now. Sorry, my bad, lol...
Well then, let's try a totally different approach. :)
Extended Globbing is likely what you need.
shopt -s extglob # turn extended globbing on
echo rm !(${number_string//,/\|})_something.sql.tar
That'll show you the command that would be executed. If you're satisfied, take the echo off. :)
This skips the need for a brute-force loop.
Explanation -
Once extglob is on, !(...) means "anything that does NOT match any of these patterns."
${number_string//,/\|} replaces all commas in the string with pipe separators, creating a match pattern for the extended glob.
Thus, !(${number_string//,/\|}) means anything NOT matching one of those patterns; !(${number_string//,/\|})_something.sql.tar then means "anything that starts with something NOT one of these patterns, followed by this string."
I created these:
$: printf "%s\n" *_something.sql.tar
0_something.sql.tar
1_something.sql.tar
2_something.sql.tar
3_something.sql.tar
4_something.sql.tar
5_something.sql.tar
6_something.sql.tar
7_something.sql.tar
8_something.sql.tar
9_something.sql.tar
then after setting extglob and using the above value for $number_string, I get this:
$: echo !(${number_string//,/\|})_something.sql.tar
5_something.sql.tar 6_something.sql.tar 7_something.sql.tar 8_something.sql.tar 9_something.sql.tar
Be careful about quoting, though. You can quote it to see the pattern itself, but then it matches nothing.
$: echo "!(${number_string//,/\|})_something.sql.tar"
!(0|1|2|3|4)_something.sql.tar
if you prefer the loop...
for f in *_something.sql.tar # iterating over all these
do case ",${f%_something.sql.tar}," in # for each, with suffix removed
",$number_string,") continue ;; # skip matches
*) rm "$f" ;; # delete nonmatches
esac
done
Write a script to do the matching, and remove those names that do not match. For example:
$ rm -rf foo
$ mkdir foo
$ cd foo
$ touch {2,4,6,8}.tar
$ echo "$number_string" | tr , \\n | sed 's/$/.tar/' > match-list
$ find . -type f -exec sh -c 'echo $1 | grep -f match-list -v -q' _ {} \; -print
./6
./8
./match-list
Replace -print with -delete to actually unlink the names. Note that this will cause problems since match-list will probably get deleted midway through and no longer exist for future matches, so you'll want to modify it a bit. Perhaps:
find . -type f -not -name match-list -name '*.tar' -exec sh -c 'echo $1 | grep -f match-list -v -q' _ {} \; -delete
In this case, there's no need to explicitly exclude 'match-list' since it will not match the -name '*.tar' primitive, but is included here for completeness.
I have sacked some previous answers, but credit is given and the resulting script is nice
$ ls -l
total 4
-rwxr-xr-x 1 boffi boffi 355 Jul 27 10:58 rm_tars_except
$ cat rm_tars_except
#!/usr/bin/env bash
dont_rm="$1"
# https://stackoverflow.com/a/10586169/2749397
IFS=',' read -r -a dont_rm_a <<< "$dont_rm"
for tarfile in ?.tar ; do
digit=$( basename "$tarfile" .tar )
# https://stackoverflow.com/a/15394738/2749397
[[ " ${dont_rm_a[#]} " =~ " ${digit} " ]] && \
echo "# Keep $tarfile" || \
echo "rm $tarfile"
done
$ touch 1.tar 3.tar 5.tar 7.tar
$ ./rm_tars_except 3,5
rm 1.tar
# Keep 3.tar
# Keep 5.tar
rm 7.tar
$ ./rm_tars_except 3,5 | sh
$ ls -l
total 4
-rw-r--r-- 1 boffi boffi 0 Jul 27 11:00 3.tar
-rw-r--r-- 1 boffi boffi 0 Jul 27 11:00 5.tar
-rwxr-xr-x 1 boffi boffi 355 Jul 27 10:58 rm_tars_except
$
If we can remove the restrictions on the "keep info" presented in a comma separated string then the script can be significantly simplified
#!/usr/bin/env bash
for tarfile in ?.tar ; do
digit=$( basename "$tarfile" .tar )
# https://stackoverflow.com/a/15394738/2749397
[[ " ${#} " =~ " ${digit} " ]] && \
echo "# Keep $tarfile" || \
echo "rm $tarfile"
done
that, of course, should be called like this ./rm_tars_except 3 5 | sh
find . -type f -name '*_something.sql.tar' | grep "<input the series with or | symbol>" | xargs rm -f
example:-
find . -type f -name '*_something.sql.tar' | grep "0\|1\|2\|3\|4" | xargs rm -f

bash - loop through subdirectories, cat files and rename with directory name

I have a folder structure like this ...
data/
---B1/
name_x_1.gz
name_y_1.gz
name_z_2.gz
name_p_2.gz
---C1
name_s_1.gz
name_t_1.gz
name_u_2.gz
name_v_2.gz
I need to go in to each subdirectory (e.g. B1) and perform the following:
cat *_1.gz > B1_1.gz
cat *_2.gz > B1_2.gz
I'm having problems with the file naming part. I can get in directories using the following:
for d in */; do
cat *_1.gz > $d_1.gz
cat *_2.gz > $d_2.gz
done
However I get an error that $d is a directory -- how do I strip the name to create the concatenated filename?
Thanks
Taking your question verbatim: If you have a variable d, where you know that it ends in / (as is the case in your example), you can get the value with this last character stripped by writing ${d:0:-1} (i.e. the substring starting at the beginning, up to (excluding) the last character.
Of course in your case, I would rather write the loop as
for d in *; do
which already creates the names without a trailing slash. But this is still probably not what you want, because d would assume the name of the entries in the directory you have cd'ed to, but you want the name of the directory itself. You can optain this for instance by $(basename "$PWD"), which turns your loop into (i.e.)
cd B1
prefix=$(basename "$PWD") # This set prefix to B1
for f in *
do
# Since your original code indicates that you want to create a *copy* of the file
# with a new name, I do the same here.
cp -v "$f" "${prefix}_$f" #
done
You can also use cat, as in your original solution, if you prefer.
If you're calling bash, you can use parameter expansion and do everything natively in the shell without creating a sub-shell to another process. This is POSIX compliant
#!/bin/bash
for dir in data/*; do
cat "$dir/"*_1.gz > "$dir/${dir##*/}_1.gz"
cat "$dir/"*_2.gz > "$dir/${dir##*/}_2.gz"
done
Sure, just descend into the directory.
# assuming PWD = data/
for d in */; do
(
cd "$d"
cat *_1.gz > "$(basename "$d")"_1.gz
cat *_2.gz > "$(basename "$d")"_2.gz
)
done
how do I strip the name to create the concatenated filename?
The simplest and most portable is with basename.
This requires Ed, which should hopefully be present on your machine. If not, I trust your distribution will have a package for it.
#!/bin/sh
cat >> edprint+.txt << EOF
1p
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
b1="${PWD}/data/B1"
c1="${PWD}/$data/C1"
find "${b1}" -maxdepth 1 -type f > b1stack
find "${c1}" -maxdepth 1 -type f > c1stack
while [ $(wc -l b1stack | cut -d' ' -f1) -gt 0 ]
do
b1line=$(ed -s b1stack < edprint+.txt)
b1name=$(basename "${b1line}")
b1suffix=$(echo "${b1name}" | cut -d'_' -f3)
b1fixed=$(echo "B1_${b1suffix}"
mv -v "${b1}/${b1line}" "${b1}/${b1fixed}"
ed -s b1stack < edpop+.txt
done
while [ $(wc -l c1stack | cut -d' ' -f1) -gt 0 ]
do
c1line=$(ed -s c1stack < edprint+.txt)
c1name=$(basename "${c1line}")
c1suffix=$(echo "${c1name}" | cut -d'_' -f3)
c1fixed=$(echo "B1_${c1suffix}"
mv -v "${c1}/${c1line}" "${c1}/${c1fixed}"
ed -s c1stack < edpop+.txt
done
rm -v ./edprint+.txt
rm -v ./edpop+.txt
rm -v ./b1stack
rm -v ./c1stack

assigning files in a directory to sub-directories

I have a 1000s of files in a directory with and I want to be able to divide them into sub-directories, with each sub-directory containing a specific number of files. I don't care what files go into what directories, just as long as each contain a specific number. All the file names have a common ending (e.g. .txt) but what goes before varies.
Anyone know an easy way to do this.
Assuming you only have files ending in *.txt, no hidden files and no directories:
#!/bin/bash
shopt -s nullglob
maxf=42
files=( *.txt )
for ((i=0;maxf*i<${#files[#]};++i)); do
s=subdir$i
mkdir -p "$s"
mv -t "$s" -- "${files[#]:i*maxf:maxf}"
done
This will create directories subdirX with X an integer starting from 0, and will put 42 files in each directory.
You can tweak the thing to have padded zeroes for X:
#!/bin/bash
shopt -s nullglob
files=( *.txt )
maxf=42
((l=${#files[#]}/maxf))
p=${#l}
for ((i=0;maxf*i<${#files[#]};++i)); do
printf -v s "subdir%0${p}d" "$i"
mkdir -p "$s"
mv -t "$s" -- "${files[#]:i*maxf:maxf}"
done
max_per_subdir=1000
start=1
while [ -e $(printf %03d $start) ]; do
start=$((start + 1))
done
find -maxdepth 1 -type f ! -name '.*' -name '*.txt' -print0 \
| xargs -0 -n $max_per_subdir echo \
| while read -a files; do
subdir=$(printf %03d $start)
mkdir $subdir || exit 1
mv "${files[#]}" $subdir/ || exit 1
start=$((start + 1))
done
How about
find *.txt -print0 | xargs -0 -n 100 | xargs -I {} echo cp {} '$(md5sum <<< "{}")' | sh
This will create several directories each containing 100 files. The name of each created directory is a md5 hash of the filenames it contains.

File with the most lines in a directory NOT bytes

I'm trying to to wc -l an entire directory and then display the filename in an echo with the number of lines.
To add to my frustration, the directory has to come from a passed argument. So without looking stupid, can someone first tell me why a simple wc -l $1 doesn't give me the line count for the directory I type in the argument? I know i'm not understanding it completely.
On top of that I need validation too, if the argument given is not a directory or there is more than one argument.
wc works on files rather than directories so, if you want the word count on all files in the directory, you would start with:
wc -l $1/*
With various gyrations to get rid of the total, sort it and extract only the largest, you could end up with something like (split across multiple lines for readability but should be entered on a single line):
pax> wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
2892 target_dir/big_honkin_file.txt
As to the validation, you can check the number of parameters passed to your script with something like:
if [[ $# -ne 1 ]] ; then
echo 'Whoa! Wrong parameteer count'
exit 1
fi
and you can check if it's a directory with:
if [[ ! -d $1 ]] ; then
echo 'Whoa!' "[$1]" 'is not a directory'
exit 1
fi
Is this what you want?
> find ./test1/ -type f|xargs wc -l
1 ./test1/firstSession_cnaiErrorFile.txt
77 ./test1/firstSession_cnaiReportFile.txt
14950 ./test1/exp.txt
1 ./test1/test1_cnaExitValue.txt
15029 total
so your directory which is the argument should go here:
find $your_complete_directory_path/ -type f|xargs wc -l
I'm trying to to wc -l an entire directory and then display the
filename in an echo with the number of lines.
You can do a find on the directory and use -exec option to trigger wc -l. Something like this:
$ find ~/Temp/perl/temp/ -exec wc -l '{}' \;
wc: /Volumes/Data/jaypalsingh/Temp/perl/temp/: read: Is a directory
11 /Volumes/Data/jaypalsingh/Temp/perl/temp//accessor1.plx
25 /Volumes/Data/jaypalsingh/Temp/perl/temp//autoincrement.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless1.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless2.plx
22 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr1.plx
27 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr2.plx
7 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee1.pm
18 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee2.pm
26 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee3.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//ftp.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit1.plx
16 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit2.plx
24 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit3.plx
33 /Volumes/Data/jaypalsingh/Temp/perl/temp//persisthash.pm
Nice question!
I saw the answers. Some are pretty good. The find ...|xrags is my most preferred. It could be simplified anyway using find ... -exec wc -l {} + syntax. But there is a problem. When the command line buffer is full a wc -l ... is called and every time a <number> total line is printer. As wc has no arg to disable this feature wc has to be reimplemented. To filter out these lines with grep is not nice:
So my complete answer is
#!/usr/bin/bash
[ $# -ne 1 ] && echo "Bad number of args">&2 && exit 1
[ ! -d "$1" ] && echo "Not dir">&2 && exit 1
find "$1" -type f -exec awk '{++n[FILENAME]}END{for(i in n) printf "%8d %s\n",n[i],i}' {} +
Or using less temporary space, but a little bit larger code in awk:
find "$1" -type f -exec awk 'function pr(){printf "%8d %s\n",n,f}FNR==1{f&&pr();n=0;f=FILENAME}{++n}END{pr()}' {} +
Misc
If it should not be called for subdirectories then add -maxdepth 1 before -type to find.
It is pretty fast. I was afraid that it would be much slower then the find ... wc + version, but for a directory containing 14770 files (in several subdirs) the wc version run 3.8 sec and awk version run 5.2 sec.
awk and wc consider the not \n ended lines differently. The last line ended with no \n is not counted by wc. I prefer to count it as awk does.
It does not print the empty files
To find the file with most lines in the current directory and its subdirectories, with zsh:
lines() REPLY=$(wc -l < "$REPLY")
wc -l -- **/*(D.nO+lined[1])
That defines a lines function which is going to be used as a glob sorting function that returns in $REPLY the number of lines of the file whose path is given in $REPLY.
Then we use zsh's recursive globbing **/* to find regular files (.), numerically (n) reverse sorted (O) with the lines function (+lines), and select the first one [1]. (D to include dotfiles and traverse dotdirs).
Doing it with standard utilities is a bit tricky if you don't want to make assumptions on what characters file names may contain (like newline, space...). With GNU tools as found on most Linux distributions, it's a bit easier as they can deal with NUL terminated lines:
find . -type f -exec sh -c '
for file do
size=$(wc -c < "$file") &&
printf "%s\0" "$size:$file"
done' sh {} + |
tr '\n\0' '\0\n' |
sort -rn |
head -n1 |
tr '\0' '\n'
Or with zsh or GNU bash syntax:
biggest= max=-1
find . -type f -print0 |
{
while IFS= read -rd '' file; do
size=$(wc -l < "$file") &&
((size > max)) &&
max=$size biggest=$file
done
[[ -n $biggest ]] && printf '%s\n' "$max: $biggest"
}
Here's one that works for me with the git bash (mingw32) under windows:
find . -type f -print0| xargs -0 wc -l
This will list the files and line counts in the current directory and sub dirs. You can also direct the output to a text file and import it into Excel if needed:
find . -type f -print0| xargs -0 wc -l > fileListingWithLineCount.txt

Resources