I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?
On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.
I want to delete files in the current folder with the following pattern.
0_something.sql.tar
I have a string provided which contains numbers
number_string="0,1,2,3,4"
How can I delete any files not included in the number_string while also keeping to the x_x.sql.tar pattern?
For example, I have these files:
0_something.sql.tar
2_something.sql.tar
4_something.sql.tar
15_something.sql.tar
Based on this logic, and the numbers in the number string - I should only remove 15 because:
It follows the pattern _.sql.tar
It doesnt have a number
in the number string
This might help you out:
s="0,1,2,3,4"
s=",${s},"
for f in *.sql.tar; do
n="${f%_*}"
[ "${n//[0-9]}" ] && continue
[ "$s" == "${s/,${n},/}" ] && echo rm -- "$f"
done
Remove the echo if this answer pleases you
What this is doing is the following:
convert your number_string s into a string which is fully comma-separated and
also starts and ends with a comma (s=",0,1,2,3,"). This allows us to search for entries like ,5,
loop over all files matched by the glob *.sql.tar
n="${f%_*}": Extract the substring before the first underscore `
[ "{n//[0-9]}" ] && continue: validate if the substring is an integer, if not, skip the file and move to the next one.
substitute the number in the number_string (with commas), if the substring does not change, it implies we should not keep the file
# Get the unmatched numbers from the second stream
# ie. files to be removed
join -v2 -o2.2 <(
# output sorted numbers on separate lines
sort <<<${number_string//,/$'\n'}
) <(
# fins all files named in such way
# and print filename, tab and path separated by newlines
find . -name '[0-9]*_something.sql.tar' -printf "%f\t%p\n" |
# extract numbers from filenames only
sed 's/\([0-9]*\)[^\t]*/\1/' |
# sort for join
sort
) |
# pass the input to xargs
# remove echo to really remove files
xargs -d '\n' echo rm
Tested on repl
$IFS can help here.
( IFS=,; for n in $number_string; do echo rm $n\_something.sql.tar; done; )
The parens run the command in a subshell so the reassignment of IFS is scoped.
Setting it to a comma lets the command parser split the string into discrete numbers for you and loop over them.
If that gives you the right list of commands you want to execute, just take out the echo. :)
UPDATE
OH! I see that now. Sorry, my bad, lol...
Well then, let's try a totally different approach. :)
Extended Globbing is likely what you need.
shopt -s extglob # turn extended globbing on
echo rm !(${number_string//,/\|})_something.sql.tar
That'll show you the command that would be executed. If you're satisfied, take the echo off. :)
This skips the need for a brute-force loop.
Explanation -
Once extglob is on, !(...) means "anything that does NOT match any of these patterns."
${number_string//,/\|} replaces all commas in the string with pipe separators, creating a match pattern for the extended glob.
Thus, !(${number_string//,/\|}) means anything NOT matching one of those patterns; !(${number_string//,/\|})_something.sql.tar then means "anything that starts with something NOT one of these patterns, followed by this string."
I created these:
$: printf "%s\n" *_something.sql.tar
0_something.sql.tar
1_something.sql.tar
2_something.sql.tar
3_something.sql.tar
4_something.sql.tar
5_something.sql.tar
6_something.sql.tar
7_something.sql.tar
8_something.sql.tar
9_something.sql.tar
then after setting extglob and using the above value for $number_string, I get this:
$: echo !(${number_string//,/\|})_something.sql.tar
5_something.sql.tar 6_something.sql.tar 7_something.sql.tar 8_something.sql.tar 9_something.sql.tar
Be careful about quoting, though. You can quote it to see the pattern itself, but then it matches nothing.
$: echo "!(${number_string//,/\|})_something.sql.tar"
!(0|1|2|3|4)_something.sql.tar
if you prefer the loop...
for f in *_something.sql.tar # iterating over all these
do case ",${f%_something.sql.tar}," in # for each, with suffix removed
",$number_string,") continue ;; # skip matches
*) rm "$f" ;; # delete nonmatches
esac
done
Write a script to do the matching, and remove those names that do not match. For example:
$ rm -rf foo
$ mkdir foo
$ cd foo
$ touch {2,4,6,8}.tar
$ echo "$number_string" | tr , \\n | sed 's/$/.tar/' > match-list
$ find . -type f -exec sh -c 'echo $1 | grep -f match-list -v -q' _ {} \; -print
./6
./8
./match-list
Replace -print with -delete to actually unlink the names. Note that this will cause problems since match-list will probably get deleted midway through and no longer exist for future matches, so you'll want to modify it a bit. Perhaps:
find . -type f -not -name match-list -name '*.tar' -exec sh -c 'echo $1 | grep -f match-list -v -q' _ {} \; -delete
In this case, there's no need to explicitly exclude 'match-list' since it will not match the -name '*.tar' primitive, but is included here for completeness.
I have sacked some previous answers, but credit is given and the resulting script is nice
$ ls -l
total 4
-rwxr-xr-x 1 boffi boffi 355 Jul 27 10:58 rm_tars_except
$ cat rm_tars_except
#!/usr/bin/env bash
dont_rm="$1"
# https://stackoverflow.com/a/10586169/2749397
IFS=',' read -r -a dont_rm_a <<< "$dont_rm"
for tarfile in ?.tar ; do
digit=$( basename "$tarfile" .tar )
# https://stackoverflow.com/a/15394738/2749397
[[ " ${dont_rm_a[#]} " =~ " ${digit} " ]] && \
echo "# Keep $tarfile" || \
echo "rm $tarfile"
done
$ touch 1.tar 3.tar 5.tar 7.tar
$ ./rm_tars_except 3,5
rm 1.tar
# Keep 3.tar
# Keep 5.tar
rm 7.tar
$ ./rm_tars_except 3,5 | sh
$ ls -l
total 4
-rw-r--r-- 1 boffi boffi 0 Jul 27 11:00 3.tar
-rw-r--r-- 1 boffi boffi 0 Jul 27 11:00 5.tar
-rwxr-xr-x 1 boffi boffi 355 Jul 27 10:58 rm_tars_except
$
If we can remove the restrictions on the "keep info" presented in a comma separated string then the script can be significantly simplified
#!/usr/bin/env bash
for tarfile in ?.tar ; do
digit=$( basename "$tarfile" .tar )
# https://stackoverflow.com/a/15394738/2749397
[[ " ${#} " =~ " ${digit} " ]] && \
echo "# Keep $tarfile" || \
echo "rm $tarfile"
done
that, of course, should be called like this ./rm_tars_except 3 5 | sh
find . -type f -name '*_something.sql.tar' | grep "<input the series with or | symbol>" | xargs rm -f
example:-
find . -type f -name '*_something.sql.tar' | grep "0\|1\|2\|3\|4" | xargs rm -f
I am searching for the longest filename from my root directory to the very bottom.
I have coded a C program that will calculate the longest file name's length and its name.
However, I cannot get the shell to redirect the long list of file names to standard input for my program to receive it.
Here is what I did:
ls -Rp | grep -v / | grep -v "Permission denied" | ./home/user/findlongest
findlongest has been compiled and I check it on one of my IDE's to make sure it's working correctly. No run time errors were detected so far.
How do I get the list of file names into my 'findlongest' code by redirecting stdin?
Try this:
find / -type f -printf '%f\n' 2>/dev/null | /home/user/findlongest
The 2>/dev/null will discard all data written to stderr (which is where you're seeing the 'Permission denied' messages from).
Or the following to remove the dependancy on your application (from here):
find / -type f -printf '%f\n' 2>/dev/null | \
awk 'length > max_length {
max_length = length; longest_line = $0
}
END {
print length(longest_line) " " longest_line
}'
What about
find / -type f | /home/user/findlongest
It will list all files from root with absolute path and print only those files you have permissions to list.
Based on the command:
find -exec basename '{}' ';'
which prints recursively only the filenames of all the files starting from the directory you are: all the filenames.
This bash line will provide the file with longest name and the its number of characters:
Note that the loop involved will make the process slow.
for i in $(find -exec basename '{}' ';'); do printf $i" " && echo -e -n $i | wc -c; done | sort -nk 2 | tail -1
By parts:
Prints the name of the file followed by a single space:
printf $i" "
Prints the number of characters of such file:
echo -e -n $i | wc -c
Sorts the output by number of characters and takes the longest one (the very latest):
sort -nk 2 | tail -1
All this inside a for loop to handle line by line.
The for sentence can be also changed by:
for i in $(find -type f -printf '%f\n');
As stated in #Attie's answer
How can I find files that have long chains of consecutive 0s (zero bytes - 0x00) as a result of disk failure? For example, how can I find files that have more than 10000 zero bytes in sequence?
Sure, I can write a program using Java or other programming language, but is there a way to do it using more or less standard Linux command line tools?
Update
You can generate test file with dd if=/dev/zero of=zeros bs=1 count=100000.
This may be a start:
find /some/starting/point -type f -size +10000 -exec \
perl -nE 'if (/\x0{10000}/) {say $ARGV; close ARGV}' '{}' +
To test for a single file, named filename:
if tr -sc '\0' '\n' < filename | tr '\0' Z | grep -qE 'Z{1000}'; then
# ...
fi
You can now use a suitable find command to filter relevant files for test.
For example, all *.txt files in PWD:
while read -rd '' filename;do
if tr -sc '\0' '\n' < "$filename" | tr '\0' Z | grep -qE 'Z{1000}'; then
# For example, simply print "$filename"
printf '%s\n' "$filename"
fi
done < <(find . -type f -name '*.txt' -print0)
Find and grep should work just fine:
grep -E "(\0)\1{1000}" <file name>
if it's a single file or a group of files in the same dir
If you want to search throughout the system there's:
find /dir/ -exec grep -E "(\0)\1{1000}" {} \; 2> /dev/null
this is very slow though, if you're looking for something faster and can do without the thousand(or large number) of zeros
I'll suggest replacing the grep with 'grep 000000000*' instead
I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20
find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.
Have a look at wc (word count).
To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done
Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done
This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done
Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done