Iterate over multiple files, remove those who contains x specific characters - linux

New to Shell. I have more than 10 thousand files and I have to delete files that contain the "<" characters less than 10 times.
wc -l * 2>&1 | while read -r num file; do ((num < 10)) && echo rm "$file"; - this one removes files if they have less than 10 lines, but how do I put "<" character?

With GNU grep, bash and GNU xargs:
#!/bin/bash
grep -cZ '<' * |
while IFS='' read -r -d '' file && read count
do
(( count < 10 )) && printf '%s\0' "$file"
done |
xargs -0r rm
Explanations
grep -cZ outputs a stream of file \0 count \n records.
You process it with a while loop that reads the file (using a NUL-byte delimiter) and the count (using a newline delimiter).
You do your filtering logic and output the files that you want to delete (in the form of NUL-delimited records).
Finally, xargs -r0 rm does the deletion of the files
Here's an alternative with GNU awk and xargs:
awk -v n=10 '
FNR == 1 {
count = 0
}
/</ && ++count >= n {
nextfile
}
ENDFILE {
if (count < n)
printf "%s%c", FILENAME, 0
}
' * |
xargs -0r rm
`

Using GNU grep (for the -m option, to make it a bit more efficient):
#!/bin/bash
for f in *; do
(( $(grep -Fc -m10 \< "$f") < 10 )) && echo rm "$f"
done
Drop the echo if output looks fine.

Related

Filtering a list by 5 files per directory

SO i have a list of files inside a tree of folders
/home/user/Scripts/example/tmp/folder2/2
/home/user/Scripts/example/tmp/folder2/3
/home/user/Scripts/example/tmp/folder2/4
/home/user/Scripts/example/tmp/folder2/5
/home/user/Scripts/example/tmp/folder2/6
/home/user/Scripts/example/tmp/folder2/7
/home/user/Scripts/example/tmp/folder2/8
/home/user/Scripts/example/tmp/folder2/9
/home/user/Scripts/example/tmp/folder2/10
/home/user/Scripts/example/tmp/other_folder/files/1
/home/user/Scripts/example/tmp/other_folder/files/2
/home/user/Scripts/example/tmp/other_folder/files/3
/home/user/Scripts/example/tmp/other_folder/files/4
/home/user/Scripts/example/tmp/other_folder/files/5
/home/user/Scripts/example/tmp/other_folder/files/6
/home/user/Scripts/example/tmp/other_folder/files/7
/home/user/Scripts/example/tmp/other_folder/files/8
/home/user/Scripts/example/tmp/other_folder/files/9
/home/user/Scripts/example/tmp/other_folder/files/10
/home/user/Scripts/example/tmp/test/example/1
/home/user/Scripts/example/tmp/test/example/2
/home/user/Scripts/example/tmp/test/example/3
/home/user/Scripts/example/tmp/test/example/4
/home/user/Scripts/example/tmp/test/example/5
/home/user/Scripts/example/tmp/test/example/6
/home/user/Scripts/example/tmp/test/example/7
/home/user/Scripts/example/tmp/test/example/8
/home/user/Scripts/example/tmp/test/example/9
/home/user/Scripts/example/tmp/test/example/10
/home/user/Scripts/example/tmp/test/other/1
/home/user/Scripts/example/tmp/test/other/2
/home/user/Scripts/example/tmp/test/other/3
/home/user/Scripts/example/tmp/test/other/4
/home/user/Scripts/example/tmp/test/other/5
/home/user/Scripts/example/tmp/test/other/6
/home/user/Scripts/example/tmp/test/other/7
/home/user/Scripts/example/tmp/test/other/8
/home/user/Scripts/example/tmp/test/other/9
/home/user/Scripts/example/tmp/test/other/10
I want to basically filter out the content of this list so I only have the highest 5 numbers for each directory.
Any ideas?
preferable in bash/shell
Expected Output:(small sample size cause of SO says too much code)
/home/user/Scripts/example/tmp/test/example/6
/home/user/Scripts/example/tmp/test/example/7
/home/user/Scripts/example/tmp/test/example/8
/home/user/Scripts/example/tmp/test/example/9
/home/user/Scripts/example/tmp/test/example/10
/home/user/Scripts/example/tmp/test/other/6
/home/user/Scripts/example/tmp/test/other/7
/home/user/Scripts/example/tmp/test/other/8
/home/user/Scripts/example/tmp/test/other/9
/home/user/Scripts/example/tmp/test/other/10
Thanks
edit - using for i in $(for i in $(dirname $(find $(pwd) -type f -name "*[0-9]*" | sort -V) | uniq) ;do ls $i | sort -V |tail -n 5 ; done) ; do readlink -f $i ; done works for a small sample size. However expanding said sample appears to long for dirname
Assuming your input data is sorted.
Try:
awk -F'/[^/]*$' '{if (NR==1 || prev_dir == $1) {i=i+1} else {i=1}; if ( i<=5){ prev_dir=$1 ; print $0}; }'
Explanation:
'/[^/]*$' <-- Set regex delimiter to get directory base-name as first field
if (NR==1 || prev_dir == $1) {i=i+1} else {i=1}; <-- Check file is from same directory. if yes increment counter by 1 else reset.
if ( i<=5){ prev_dir=$1 ; print $0}; }' <-- Print first 5 records of current directory.
Demo:
$awk -F'/[^/]*$' '{if (NR==1 || prev_dir == $1) {i=i+1} else {i=1}; if ( i<=5){ prev_dir=$1 ; print $0 }; }' temp.txt
/home/user/Scripts/example/tmp/folder2/2
/home/user/Scripts/example/tmp/folder2/3
/home/user/Scripts/example/tmp/folder2/4
/home/user/Scripts/example/tmp/folder2/5
/home/user/Scripts/example/tmp/folder2/6
/home/user/Scripts/example/tmp/other_folder/files/1
/home/user/Scripts/example/tmp/other_folder/files/2
/home/user/Scripts/example/tmp/other_folder/files/3
/home/user/Scripts/example/tmp/other_folder/files/4
/home/user/Scripts/example/tmp/other_folder/files/5
$cat temp.txt
/home/user/Scripts/example/tmp/folder2/2
/home/user/Scripts/example/tmp/folder2/3
/home/user/Scripts/example/tmp/folder2/4
/home/user/Scripts/example/tmp/folder2/5
/home/user/Scripts/example/tmp/folder2/6
/home/user/Scripts/example/tmp/folder2/7
/home/user/Scripts/example/tmp/folder2/8
/home/user/Scripts/example/tmp/folder2/9
/home/user/Scripts/example/tmp/folder2/10
/home/user/Scripts/example/tmp/other_folder/files/1
/home/user/Scripts/example/tmp/other_folder/files/2
/home/user/Scripts/example/tmp/other_folder/files/3
/home/user/Scripts/example/tmp/other_folder/files/4
/home/user/Scripts/example/tmp/other_folder/files/5
/home/user/Scripts/example/tmp/other_folder/files/6
/home/user/Scripts/example/tmp/other_folder/files/7
/home/user/Scripts/example/tmp/other_folder/files/8
/home/user/Scripts/example/tmp/other_folder/files/9
/home/user/Scripts/example/tmp/other_folder/files/10
$
Here is an implementation in plain bash:
#!/bin/bash
prevdir=
while read -r line; do
dir=${line%/*}
[[ $dir == "$prevdir" ]] || { n=0; prevdir=$dir; }
((n++ < 5)) && echo "$line"
done
You can use it like:
./script < file.list # If file.list already sorted by a reverse version sort
or,
sort -rV file.list | ./script # If the file.list is not sorted
or,
find /home/user/Scripts -type f | sort -rV | ./script
Also, you may want to append | tac to the pipelines above.

rename files which produced by split

I splitted the huge file and output is several files which start by x character.
I want to rename them and make a list which sorted by name like below:
part-1.gz
part-2.gz
part-3.gz ...
I tried below CMD:
for (( i = 1; i <= 3; i++ )) ;do for f in `ls -l | awk '{print $9}' | grep '^x'`; do mv $f part-$i.gz ;done ; done;
for f in `ls -l | awk '{print $9}' | grep '^x'`; do for i in 1 .. 3 ; do mv -- "$f" "${part-$i}.gz" ;done ; done;
for i in 1 .. 3 ;do for f in `ls -l | awk '{print $9}' | grep '^x'`; do mv -- "$f" "${part-$i}.gz" ;done ; done;
for f in `ls -l | awk '{print $9}' | grep '^x'`; do mv -- "$f" "${f%}.gz" ;done
Tip: don't do ls -l if you only need the file names. Even better, don't use ls at all, just use the shell's globbing ability. x* expands to all file names starting with x.
Here's a way to do it:
i=1; for f in x*; do mv $f $(printf 'part-%d.gz' $i); ((i++)); done
This initializes i to 1, and then loops over all file names starting with x in alphabetical order, assigning each file name in turn to the variable f. Inside the loop, it renames $f to $(printf 'part-%d.gz' $i), where the printf command replaces %d with the current value of i. You might want something like %02d if you need to prefix the number with zeros. Finally, still inside the loop, it increments i so that the next file receives the next number.
Note that none of this is safe if the input file names contain spaces, but yours don't.

Find file with largest number of lines in single directory

I'm trying to create a function that only outputs the file with the largest number of lines in a directory (and not any sub-directories). I'm being asked to make use of the wc function but don't really understand how to read each file individually and then sort them just to find the largest. Here is what I have so far:
#!/bin/bash
function sort {
[ $# -ne 1 ] && echo "Invalid number of arguments">&2 && exit 1;
[ ! -d "$1" ] && echo "Invalid input: not a directory">&2 && exit 1;
# Insert function here ;
}
# prompt if wanting current directory
# if yes
# sort $PWD
# if no
#sort $directory
This solution is almost pure Bash (wc is the only external command used):
shopt -s dotglob # Include filenames with initial '.' in globs
shopt -s nullglob # Make globs produce nothing when nothing matches
dir=$1
maxlines=-1
maxfile=
for file in "$dir"/* ; do
[[ -f $file ]] || continue # Skip non-files
[[ -L $file ]] && continue # Skip symlinks
numlines=$(wc -l < "$file")
if (( numlines > maxlines )) ; then
maxfile=$file
maxlines=$numlines
fi
done
[[ -n "$maxfile" ]] && printf '%s\n' "$maxfile"
Remove the shopt -s dotglob if you don't want to process files whose names begin with a dot. Remove the [[ -L $file ]] && continue if you want to process symlinks to files.
This solution should handle all filenames (ones containing spaces, ones containing glob characters, ones beginning with '-', ones containing newlines, ...), but it runs wc for each file so it may be unacceptably slow compared to solutions that feed many files to wc at once if you need to handle directories that have large numbers of files.
How about this:
wc -l * | sort -nr | head -2 | tail -1
wc -l counts lines (you get an error for directories, though), then sort in reverse order treating the first column as a number, then take the first two lines, then the second, as we need to skip over the total line.
wc -l * 2>/dev/null | sort -nr | head -2 | tail -1
The 2>/dev/null throws away all the errors, if you want a neater output.
Use a function like this:
my_custom_sort() {
for i in "${1+$1/}"*; do
[[ -f "$i" ]] && wc -l "$i"
done | sort -n | tail -n1 | cut -d" " -f2
}
And use it with or without directory (in latter case, it uses the current directory):
my_custom_sort /tmp
helloworld.txt

How do I get the number of lines for specific files in a large directory?

I have the command "find . -name '*.dmp' | xargs wc -l" to get the lines from all the dmp files in a directory. The dump files naming convention is "dump-10181.dmp" with the number being a unique incremental number.
How do I get the number of lines for only files with the number 50 - 678?
Try the following:
seq 50 678 | xargs -I'{}' cat dump{} | wc -l
Longer than other solutions but more general:
for f in *.dmp ; do \
n=${f##*-}; n=${n%.dmp}; \
[[ "$n" = "" || "$n" = *[^0-9]* ]] && continue ;\
n=$((10#$n)) ; ((n >= 50 && n <= 678)) && cat "./$f" ;\
done | wc -l

how to remove only the duplication file under some directory ( with the same cksum )

I build the following script in order to remove files that are with the same cksum ( or content )
The problem is that the script can remove files twice as the following example ( output )
My target is to remove only the duplication file and not the source file ,
SCRIPT OUTPUT:
Starting:
Same: /tmp/File_inventury.out /tmp/File_inventury.out.1
Remove: /tmp/File_inventury.out.1
Same: /tmp/File_inventury.out.1 /tmp/File_inventury.out
Remove: /tmp/File_inventury.out
Same: /tmp/File_inventury.out.2 /tmp/File_inventury.out.3
Remove: /tmp/File_inventury.out.3
Same: /tmp/File_inventury.out.3 /tmp/File_inventury.out.2
Remove: /tmp/File_inventury.out.2
Same: /tmp/File_inventury.out.4 /tmp/File_inventury.out
Remove: /tmp/File_inventury.out
Done.
.
MY SCRIPT:
#!/bin/bash
DIR="/tmp"
echo "Starting:"
for file1 in ${DIR}/File_inventury.out*; do
for file2 in ${DIR}/File_inventury.out*; do
if [ $file1 != $file2 ]; then
diff "$file1" "$file2" 1>/dev/null
STAT=$?
if [ $STAT -eq 0 ]
then
echo "Same: $file1 $file2"
echo "Remove: $file2"
rm "$file1"
break
fi
fi
done
done
echo "Done."
.
In any case I want to ear – other options about how to remove files that are with the same content or cksum ( actually need only to remove the duplication file and not the primary file )
please advice how we can do that under solaris OS , ( options for example - find one liner , awk , sed ... etc )
This version should be more efficient. I was nervous about paste matching the correct rows, but it looks like POSIX specifies that glob'ing is sorted by default.
for i in *; do
date -u +%Y-%m-%dT%TZ -r "$i";
done > .stat; #store the last modification time in a sortable format
cksum * > .cksum; #store the cksum, size, and filename
paste .stat .cksum | #data for each file, 1 per row
sort | #sort by mtime so original comes first
awk '{
if($2 in f)
system("rm -v " $4); #rm if we have seen an occurrence of this cksum
else
f[$2]++ #count the first occurrence
}'
This should run in O(n * log(n)) time, reading each file only once.
You can put this in a shell script as:
#!/bin/sh
for i in *; do
date -u +%Y-%m-%dT%TZ -r "$i";
done > .stat;
cksum * > .cksum;
paste .stat .cksum | sort | awk '{if($2 in f) system("rm -v " $4); else f[$2]++}';
rm .stat .cksum;
exit 0;
Or do it as a one-liner:
for i in *; do date -u +%Y-%m-%dT%TZ -r "$i"; done > .stat; cksum * > .cksum; paste .stat .cksum | sort | awk '{if($2 in f) system("rm -v " $4); else f[$2]++}'; rm .stat .cksum;
I used an array as index map. So I think it is just O(n) ?
#!/bin/bash
arr=()
dels=()
for f in $1; do
read ck x fn <<< $(cksum $f)
if [[ -z ${arr[$ck]} ]]; then
arr[$ck]=$fn
else
echo "Same: ${arr[$ck]} $fn"
echo "Remove: $fn"
rm $fn
fi
done

Resources