bash remove duplicate files based on sequence number at the end

bash remove duplicate files based on sequence number at the end - linux

Hy,
I'm trying to delete some duplicate files in a folder (aprox. 50000 files) that have the same name but the only thing that differs is a sequence number at the end :
aaaaaaaaaa.ext.84837384
aaaaaaaaaa.ext.44549388
aaaaaaaaaa.ext.22134455
bbbbbbbbbb.ext.11244355
bbbbbbbbbb.ext.88392456
I want to delete the duplicate files based on minimum of sequence number (.22134455 to be hold for aaaaaaaaaa.ext and .11244355 to be hold for bbbbbbbbbbb)
I mentioned that i have a lot of files in the folder ~ 50.000 files and sorting and filtering based on size and md5 would take like forever.
I tried find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate but is taking forever.
Thank you very much

Use this
find . -name '*.ext.*' -print0 | sort -z | awk -v RS='\0' -F. '{fn=$0; num=$NF; $NF=""; if(a[$0]){printf "%s\0", fn};a[$0]++;}' | xargs -n 100 -0 rm -f
Explanation:
find . -name '*.ext.*' -print0: Print filenames delimited by a null character.
sort -z: Sort zero delimited entries.
awk: separate records by null character & fields by a .. strip off the last field - number & remember the remaining filename. Except for the first entry, print other file names, separated by null character.
xargs -0: receive null char separated filenames on stdin & rm -f them.
Assumption: All the files are in the current directory.
Add -maxdepth 1 option to find command, if there are sub-directories & you want to skip iterating through them.

This script will remove all duplicated files in the directory that's in.
List and sort files by filename, sequence number will be used to sort duplicates, then remove the file if it was already 'visited', else just saved the filename minus sequence in a temporary variable.
#!/bin/bash
tmp_filename=
for full_filename in `ls | sort`; do
filename=$(basename "$full_filename")
extension="${filename##*.}"
filename="${filename%.*}"
if [[ "$tmp_filename" == "$filename" ]]; then
rm "$full_filename"
else
tmp_filename="$filename"
fi
done

Related

Bash: Compare alphanumeric string with lower and upper case

Given a directory with files with an alphanumeric name:
file45369985.xml
file45793220.xml
file0005461x.xml
Also, given a csv table with a list of files
file45369985.xml,file,45369985,.xml,https://www.tib.eu/de/suchen/id/FILE:45369985/Understanding-terrorism-challenges-perspectives?cHash=16d713678274dd2aa205fc07b2fc5b86
file0005461X.xml,file,0005461X,.xml,https://www.tib.eu/de/suchen/id/FILE:0005461X/The-reality-of-social-construction?cHash=5d8152fbbfae77357c1ec6f443f8c8a4
I would like to match all files in the csv table with the directory's content and move them somewhere else. However, I cannot switch off the case sensitivity in this command:
while read p; do
data_set=$(echo "$p" | cut -f1 -d",")
# do something else
done
How can the "X-Files" be correctly matched as well?

Given the format of the csv file (no quotes around the first field), I show an answer for filenames without newlines.
List all files in current directory
find . -maxdepth 1 -type f -printf "%f\n"
Look for one filename in that list (ignoring case)
grep -Fix file0005461X.xml <(find . -maxdepth 1 -type f -printf "%f\n")
Show first field only from file
cut -d"," -f1 csvfile
Pretend that the output is a file
<(cut -d"," -f1 csvfile)
Tell grep to use that "file" for strings to look for with option f
grep -Fixf <(cut -d"," -f1 csvfile) <(find . -maxdepth 1 -type f -printf "%f\n")
Move to /tmp
grep -Fixf <(cut -d"," -f1 csvfile) <(find . -maxdepth 1 -type f -printf "%f\n") |
xargs -i{} mv "{}" /tmp

You can use join to perform a inner join between the CSV and the file list:
join -i -t, \
<(sort -t, -k1 list.csv) \
<(find given_dir -maxdepth 1 -mindepth 1 -type f -printf "%f\n" | sort) \
-o "2.1"
Explanation:
-i: perform a case insensitive comparison for the join
-t,: use the comma as a field separator
<(sort -t, -k1 list.csv): sort the CSV file on the first field using the comma as a field separator and use the output as a file, and perform a process substitution to "connect the output" to a file and use it as file argument (see Bash manual page)
<(find given_dir -maxdepth 1 -mindepth 1 -type f -printf "%f\n" | sort): list all the file stored in the root of the given directory given_dir (and not in the subdirectories), sort it and perform a process substitution like the above
-o "2.1": list the first column of the second input (the find output) of the join result
Note: this solution relies on GNU find due to printf command

awk -F [,\.] '{ print substr($1,1,length($1)-1)toupper(substr($1,length($1)))"."$2;print substr($1,1,length($1)-1)tolower(substr($1,length($1)))"."$2 }' csvfile | while read line
do
find /path -name "$line" -exec mv '{}' /newpath \;
done
Use awk and set the file delimiter to . and , Take each line and generate both an uppercase and lowercase X version of the file name.
Loop through this output and find the file in a given path. If the file exists, execute the move command to a given path.

You can use grep -i to make case insensitive matches:
while read p; do
data_set=$(echo "$p" | cut -f1 -d",")
match=$(ls $your_dir | grep -i "^$data_set\$")
if [ ! -z match ]; then
mv "$match" $another_dir
fi
done

Find the longest file name in Linux

I am searching for the longest filename from my root directory to the very bottom.
I have coded a C program that will calculate the longest file name's length and its name.
However, I cannot get the shell to redirect the long list of file names to standard input for my program to receive it.
Here is what I did:
ls -Rp | grep -v / | grep -v "Permission denied" | ./home/user/findlongest
findlongest has been compiled and I check it on one of my IDE's to make sure it's working correctly. No run time errors were detected so far.
How do I get the list of file names into my 'findlongest' code by redirecting stdin?

Try this:
find / -type f -printf '%f\n' 2>/dev/null | /home/user/findlongest
The 2>/dev/null will discard all data written to stderr (which is where you're seeing the 'Permission denied' messages from).
Or the following to remove the dependancy on your application (from here):
find / -type f -printf '%f\n' 2>/dev/null | \
awk 'length > max_length {
max_length = length; longest_line = $0
}
END {
print length(longest_line) " " longest_line
}'

What about
find / -type f | /home/user/findlongest
It will list all files from root with absolute path and print only those files you have permissions to list.

Based on the command:
find -exec basename '{}' ';'
which prints recursively only the filenames of all the files starting from the directory you are: all the filenames.
This bash line will provide the file with longest name and the its number of characters:
Note that the loop involved will make the process slow.
for i in $(find -exec basename '{}' ';'); do printf $i" " && echo -e -n $i | wc -c; done | sort -nk 2 | tail -1
By parts:
Prints the name of the file followed by a single space:
printf $i" "
Prints the number of characters of such file:
echo -e -n $i | wc -c
Sorts the output by number of characters and takes the longest one (the very latest):
sort -nk 2 | tail -1
All this inside a for loop to handle line by line.
The for sentence can be also changed by:
for i in $(find -type f -printf '%f\n');
As stated in #Attie's answer

Sort and rename files using a 9 digit sequence number

I want to rename multiple jpg files in a directory so they have 9 digit sequence number. I also want the files to be sorted by date from oldest to newest. I came up with this:
ls -tr | nl -v 100000000 | while read n f; do mv "$f" "$n.jpg"; done
this renames the files as I want them but the sequence numbers do not follow the date. I have also tried doing
ls -tr | cat -n .....
but that does not allow me to sepecify the starting sequence number.
Any suggestions what's wrong with my syntax?
Any other ways of achieving my goal?
Thanks

If any of your filename contains a whitespace, you can use the following:
i=100000000
find -type f -printf '%T# %p\0' | \
sort -zk1nr | \
sed -z 's/^[^ ]* //' | \
xargs -0 -I % echo % | \
while read f; do
mv "$f" "$(printf "%09d" $i).jpg"
let i++
done
Note that this doesn't use ls for parsing, but uses the null byte as field separator in the different commands, respectively set as \0, -z, -0.
The find command prints the file time together with the name.
Then the file are sorted and sed removes the timestamp. xargs is giving the filenames to the mv command through read.

DIR="/tmp/images"
FILELIST=$(ls -tr ${DIR})
n=1
for file in ${FILELIST}; do
printf -v digit "%09d" $n
mv "$DIR/${file}" "$DIR/${digit}.jpg"
n=$[n + 1]
done
Something like this? Then you can use n to sepecify the starting sequence number. However, if you have spaces in your file names this would not work.

If using external tool is acceptable, you can use rnm:
rnm -ns '/i/.jpg' -si 100000000 -s/mt *.jpg
-ns: Name string (new name).
/i/: Index (A name string rule).
-si: Option that sets starting index.
-s/mt: Sort according to modification time.
If you want an arbitrary increment value:
rnm -ns '/i/.jpg' -si 100000000 -inc 45 -s/mt *.jpg
-inc: Specify an increment value.

Finding and counting duplicate filenames

Need to search through all sub folders of current folder recursively and list all files of certain type and number of duplicates
e.g. if current folder is home and there are 2 sub folders dir1 and dir2
Then i need it to search dir1 and dir2 and list file names and number of duplicates
this is what i have so far:
I am using
find -name "*.h" .
to get a list of all the files of certain type.
I need to now count duplicates and create a new list like
file1.h 2
file2.h 1
where file1 is file name and 2 is number of duplicates overall.

Use uniq --count
You can use a set of core utilities to do this quickly. For example, given the following setup:
mkdir -p foo/{bar,baz}
touch foo/bar/file{1,2}.h
touch foo/baz/file{2,3}.h
you can then find (and count) the files with a pipeline like this:
find foo -name \*.h -print0 | xargs -0n1 basename | sort | uniq -c
This results in the following output:
1 file1.h
2 file2.h
1 file3.h
If you want other output formats, or to order the list in some other way than alphabetically by file, you can extend the pipeline with another sort (e.g. sort -nr) or reformat your columns with sed, awk, perl, ruby, or your text-munging language of choice.

find -name "*.h"|awk -F"/" '{a[$NF]++}END{for(i in a)if(a[i]>1)print i,a[i]}'
Note: This will print files with similar names and only if there are more than one.

Using a shell script, the following code will print a filename of there are duplicates, then below that list all duplicates.
The script is used as in the following exanmple:
./find_duplicate.sh ./ Project
and will search the current directory tree for file names with 'project' in it.
#! /bin/sh
find "${1}" -iname *"${2}"* -printf "%f\n" \
| tr '[A-Z]' '[a-z]' \
| sort -n \
| uniq -c \
| sort -n -r \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && break
FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
echo "count: ${COUNT} | file: ${FILE}"
FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
find ${1} -iname "${FILE}" -exec echo " {}" ';'
echo
done
if you wish to search for all files (and not search for a pattern in the name, replace the line:
find "${1}" -iname *"${2}"* -printf "%f\n" \
with
find "${1}" -type f -printf "%f\n" \

Combining greps to make script to count files in folder

I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20

find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.

Have a look at wc (word count).

To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done

Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done

This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done

Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

bash remove duplicate files based on sequence number at the end - linux

Related

Bash: Compare alphanumeric string with lower and upper case

Find the longest file name in Linux

Sort and rename files using a 9 digit sequence number

Finding and counting duplicate filenames

Combining greps to make script to count files in folder

Categories

Resources