Finding and counting duplicate filenames

Finding and counting duplicate filenames - linux

Need to search through all sub folders of current folder recursively and list all files of certain type and number of duplicates
e.g. if current folder is home and there are 2 sub folders dir1 and dir2
Then i need it to search dir1 and dir2 and list file names and number of duplicates
this is what i have so far:
I am using
find -name "*.h" .
to get a list of all the files of certain type.
I need to now count duplicates and create a new list like
file1.h 2
file2.h 1
where file1 is file name and 2 is number of duplicates overall.

Use uniq --count
You can use a set of core utilities to do this quickly. For example, given the following setup:
mkdir -p foo/{bar,baz}
touch foo/bar/file{1,2}.h
touch foo/baz/file{2,3}.h
you can then find (and count) the files with a pipeline like this:
find foo -name \*.h -print0 | xargs -0n1 basename | sort | uniq -c
This results in the following output:
1 file1.h
2 file2.h
1 file3.h
If you want other output formats, or to order the list in some other way than alphabetically by file, you can extend the pipeline with another sort (e.g. sort -nr) or reformat your columns with sed, awk, perl, ruby, or your text-munging language of choice.

find -name "*.h"|awk -F"/" '{a[$NF]++}END{for(i in a)if(a[i]>1)print i,a[i]}'
Note: This will print files with similar names and only if there are more than one.

Using a shell script, the following code will print a filename of there are duplicates, then below that list all duplicates.
The script is used as in the following exanmple:
./find_duplicate.sh ./ Project
and will search the current directory tree for file names with 'project' in it.
#! /bin/sh
find "${1}" -iname *"${2}"* -printf "%f\n" \
| tr '[A-Z]' '[a-z]' \
| sort -n \
| uniq -c \
| sort -n -r \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && break
FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
echo "count: ${COUNT} | file: ${FILE}"
FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
find ${1} -iname "${FILE}" -exec echo " {}" ';'
echo
done
if you wish to search for all files (and not search for a pattern in the name, replace the line:
find "${1}" -iname *"${2}"* -printf "%f\n" \
with
find "${1}" -type f -printf "%f\n" \

Related

Linux: what's a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?

On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.

Bash: Compare alphanumeric string with lower and upper case

Given a directory with files with an alphanumeric name:
file45369985.xml
file45793220.xml
file0005461x.xml
Also, given a csv table with a list of files
file45369985.xml,file,45369985,.xml,https://www.tib.eu/de/suchen/id/FILE:45369985/Understanding-terrorism-challenges-perspectives?cHash=16d713678274dd2aa205fc07b2fc5b86
file0005461X.xml,file,0005461X,.xml,https://www.tib.eu/de/suchen/id/FILE:0005461X/The-reality-of-social-construction?cHash=5d8152fbbfae77357c1ec6f443f8c8a4
I would like to match all files in the csv table with the directory's content and move them somewhere else. However, I cannot switch off the case sensitivity in this command:
while read p; do
data_set=$(echo "$p" | cut -f1 -d",")
# do something else
done
How can the "X-Files" be correctly matched as well?

Given the format of the csv file (no quotes around the first field), I show an answer for filenames without newlines.
List all files in current directory
find . -maxdepth 1 -type f -printf "%f\n"
Look for one filename in that list (ignoring case)
grep -Fix file0005461X.xml <(find . -maxdepth 1 -type f -printf "%f\n")
Show first field only from file
cut -d"," -f1 csvfile
Pretend that the output is a file
<(cut -d"," -f1 csvfile)
Tell grep to use that "file" for strings to look for with option f
grep -Fixf <(cut -d"," -f1 csvfile) <(find . -maxdepth 1 -type f -printf "%f\n")
Move to /tmp
grep -Fixf <(cut -d"," -f1 csvfile) <(find . -maxdepth 1 -type f -printf "%f\n") |
xargs -i{} mv "{}" /tmp

You can use join to perform a inner join between the CSV and the file list:
join -i -t, \
<(sort -t, -k1 list.csv) \
<(find given_dir -maxdepth 1 -mindepth 1 -type f -printf "%f\n" | sort) \
-o "2.1"
Explanation:
-i: perform a case insensitive comparison for the join
-t,: use the comma as a field separator
<(sort -t, -k1 list.csv): sort the CSV file on the first field using the comma as a field separator and use the output as a file, and perform a process substitution to "connect the output" to a file and use it as file argument (see Bash manual page)
<(find given_dir -maxdepth 1 -mindepth 1 -type f -printf "%f\n" | sort): list all the file stored in the root of the given directory given_dir (and not in the subdirectories), sort it and perform a process substitution like the above
-o "2.1": list the first column of the second input (the find output) of the join result
Note: this solution relies on GNU find due to printf command

awk -F [,\.] '{ print substr($1,1,length($1)-1)toupper(substr($1,length($1)))"."$2;print substr($1,1,length($1)-1)tolower(substr($1,length($1)))"."$2 }' csvfile | while read line
do
find /path -name "$line" -exec mv '{}' /newpath \;
done
Use awk and set the file delimiter to . and , Take each line and generate both an uppercase and lowercase X version of the file name.
Loop through this output and find the file in a given path. If the file exists, execute the move command to a given path.

You can use grep -i to make case insensitive matches:
while read p; do
data_set=$(echo "$p" | cut -f1 -d",")
match=$(ls $your_dir | grep -i "^$data_set\$")
if [ ! -z match ]; then
mv "$match" $another_dir
fi
done

Bash Script to find the Duplicate filenames in the same directory and send a Notification email

My aim is to find for any duplicates file names by comparing all the file names(abc.xyz , def.csv) in the same Directory. if there aren't any duplicate file names then move all those files(.csv , .xlsx) in the mentioned file path into Archive path.
If there are duplicate filenames, then fetch the names of those duplicate filenames only with their modified date timestamp and send a notification email to the team and move the remaining non-duplicate filenames to the archive folder.
As you can see I am trying to achieve it by the following code.
if the find command is empty, then perform the if condition and perform 'mv' command and exit the script entirely, if they are duplicate files, then exit the if condition and pipe the duplicate files and perform the mail and date stamp operation.
However the code what actually doing is, sending a notification email if it finds or does not find any duplicate files.
if there are duplicate files, then send an email with duplicate filenames and modification name , if there is no duplicate filnames, then it is sending the filename as blank and current time as modified time.
currently there are no files outside archive(only files inside archive, but all the files inside the archive are unique and looks good) so technically it shouldn't send any notification email.
{
DATE=`date +"%Y-%m-%d"`
dirname=/marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation
tempfile=myTempfileName
find $dirname -type f > $tempfile
cat $tempfile | sed 's_.*/__' | sort | uniq -d|
while read fileName
do
grep "$fileName" $tempfile
done
}
if ["$fileName" == ""]; then
mv /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/*.xlsx /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/Archive
mv /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/*.csv /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/Archive
exit 1
fi | tee '/marketsource/scripts/tj_var.txt' | awk -F"/" '{print $NF}' | tee '/marketsource/scripts/tj_var.txt' | sort -u | tee '/marketsource/scripts/tj_mail.txt'
DATE=`date +"%Y-%m-%d"`
printf "%s\n" "$(</marketsource/scripts/tj_mail.txt)" | while IFS= read -r filename; do
mtime=$(stat -c %y "/marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/$filename")
printf 'Duplicate Filename - %s Uploaded time - %s\n\n' "$filename" "$mtime"
done | mail -s "Duplicate file found ${DATE}" ti#gmail.com

find for any duplicates file names by comparing all the file names(abc.xyz , def.csv) in the same Directory.
its .xlsx and .csv extensions
I'm assuming there are no whitespaces in filenames
IFS=$'\n'
duplicates=($(
find . -maxdepth 1 -type f '(' -name '*.xlsx' -o -name '*.csv' ')' \
-exec bash -c 'printf "%s %s\n" "$1" "${1%.*}"' -- {} \; |
sort -k1 |
uniq -f1 -d |
cut -d' ' -f2
))
# or simpler:
duplicates=($(
find . -type f '(' -name '*.xlsx' -o -name '*.csv' ')' |
sed 's/\.[^\.]*$//' |
sort |
uniq -d
))
IFS=$' \t\n'
# if there aren't any duplicate file names then move all those files(.csv , .xlsx) in the mentioned file path into Archive path
if ((${#duplicates[#]} == 0)); then
find . -type f '(' -name '*.xlsx' -o -name '*.csv' ')' \
-exec mv -v {} "$the_archive_path" \;
# If there are duplicate filenames, then
else
# fetch the names of those duplicate filenames only with their modified date timestamp
duplicate_filenames_with_modified_date=$(
{
printf "%s.xlsx\n" "${duplicates[#]}"
printf "%s.csv\n" "${duplicates[#]}"
} |
xargs -d$'\n' stat -c '%n %y\n'
)
# and send a notification email to the team and
mail the_team <<<"a notification email"
# move the remaining non-duplicate filenames to the archive folder.
find . -maxdepth 1 -type f '(' -name '*.xlsx' -o -name '*.csv' ')' \
-exec bash -c 'echo "$1" "${1%.*}"' -- {} \; | tee /dev/stderr |
sort -k2 |
uniq -f1 -u |
cut -d' ' -f1 |
xargs -r -d$'\n' -I{} echo mv -v {} "$the_archive_folder"
fi

Combining greps to make script to count files in folder

I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20

find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.

Have a look at wc (word count).

To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done

Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done

This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done

Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done

Finding the number of files in a directory for all directories in pwd

I am trying to list all directories and place its number of files next to it.
I can find the total number of files ls -lR | grep .*.mp3 | wc -l. But how can I get an output like this:
dir1 34
dir2 15
dir3 2
...
I don't mind writing to a text file or CSV to get this information if its not possible to get it on screen.
Thank you all for any help on this.

This seems to work assuming you are in a directory where some subdirectories may contain mp3 files. It omits the top level directory. It will list the directories in order by largest number of contained mp3 files.
find . -mindepth 2 -name \*.mp3 -print0| xargs -0 -n 1 dirname | sort | uniq -c | sort -r | awk '{print $2 "," $1}'
I updated this with print0 to handle filenames with spaces and other tricky characters and to print output suitable for CSV.

find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c
Or, if order (dir-> count instead of count-> dir) is really important to you:
find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c | awk '{print $2" "$1}'

There's probably much better ways, but this seems to work.
Put this in a shell script:
#!/bin/sh
for f in *
do
if [ -d "$f" ]
then
cd "$f"
c=`ls -l *.mp3 2>/dev/null | wc -l`
if test $c -gt 0
then
echo "$f $c"
fi
cd ..
fi
done

With Perl:
perl -MFile::Find -le'
find {
wanted => sub {
return unless /\.mp3$/i;
++$_{$File::Find::dir};
}
}, ".";
print "$_,$_{$_}" for
sort {
$_{$b} <=> $_{$a}
} keys %_;
'

Here's yet another way to even handle file names containing unusual (but legal) characters, such as newlines, ...:
# count .mp3 files (using GNU find)
find . -xdev -type f -iname "*.mp3" -print0 | tr -dc '\0' | wc -c
# list directories with number of .mp3 files
find "$(pwd -P)" -xdev -depth -type d -exec bash -c '
for ((i=1; i<=$#; i++ )); do
d="${#:i:1}"
mp3s="$(find "${d}" -xdev -type f -iname "*.mp3" -print0 | tr -dc "${0}" | wc -c )"
[[ $mp3s -gt 0 ]] && printf "%s\n" "${d}, ${mp3s// /}"
done
' "'\\0'" '{}' +

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Finding and counting duplicate filenames - linux

find -name "*.h"|awk -F"/" '{a[$NF]++}END{for(i in a)if(a[i]>1)print i,a[i]}' Note: This will print files with similar names and only if there are more than one.

Related

Linux: what's a fast way to find all duplicate files in a directory?

Bash: Compare alphanumeric string with lower and upper case

Bash Script to find the Duplicate filenames in the same directory and send a Notification email

Combining greps to make script to count files in folder

Finding the number of files in a directory for all directories in pwd

Categories

Resources