find files with same name in different directories and count duplicates - linux

I hope you can help me with the following problem. I have 24 directories each containing many (1000's) of files. I would like to find out which combination of directories contains the most number of duplicate (by name only) files. For example if we only consider 4 directories
dir1 dir2 dir3 dir4
with the following directory contents
dir1
1.fa 2.fa 3.fa 4.fa 5.fa
dir2
1.fa 10.fa 15.fa
dir3
1.fa 2.fa 3.fa
dir4
1.fa 2.fa 3.fa 5.fa 8.fa 10.fa
Therefore, the combination of directories dir1 and dir4 contain the most duplicate files (4).
The problem becomes quite large with 24 directories so I was thinking that I might use a brute force approach. Something along the lines of
count all duplicate files that occur in all 24 directories
drop a directory and count the number of duplicate files
replace the directory and drop another one then count number
repeat for all directories
get the subset of 23 directories with max number of duplicate files
repeat the above 2-5 and keep the 22 directories with most duplicate files
repeat until only 2 directories left
choose the combination of directories with the max number of duplicate files
If any one has a way of doing this I would be very grateful for some advice. I thought of using fdupes or diff but cant figure out how to parse the output and summarise.

I tagged your question with algorithm as I am unaware of any existing bash / linux tools that can help you directly solve this problem. The easiest way would be to construct algorithm for this in a programming language such as Python, C++, or Java instead of using bash shells.
That being said, here's a high level analysis of your problem: At first glance it looks like a mininum set cover problem, but it's actually broken down into 2 parts:
Part 1 - What is the set of files to cover?
You want to find the combination of directories that cover the most number of duplicate files. But first you need to know what the maximum set of duplicate files are within your 24 directories.
Since the intersection of files between 2 directories is always greater than or equal to the intersection with a 3rd directory, you go through all pairs of directories and find what the maximum intersection set is:
(24 choose 2) = 276 comparisons
You take the largest intersection set found and use that as the set you are actually trying to cover.
Part 2 - The minimum set cover problem
This is a well-studied problem in computer science, so you are better served reading from the writings of people much smarter than I.
The only thing I have to note that it's a NP-Complete problem, so it's not trivial.
This is the best I can do to address the original formulation of your question, but I have a feeling that it's overkill for what you actually need to accomplish. You should consider updating your question with the actual problem that you need to solve.

Count duplicate file names in shell:
#! /bin/sh
# directories to test for
dirs='dir1 dir2 dir3 dir4'
# directory pairs already seen
seen=''
for d1 in $dirs; do
for d2 in $dirs; do
if echo $seen | grep -q -e " $d1:$d2;" -e " $d2:$d1;"; then
: # don't count twice
elif test $d1 != $d2; then
# remember pair of directories
seen="$seen $d1:$d2;"
# count duplicates
ndups=`ls $d1 $d2 | sort | uniq -c | awk '$1 > 1' | wc -l`
echo "$d1:$d2 $ndups"
fi
done
# sort decreasing and take the first
done | sort -k 2rn | head -1

./count_dups.sh:
1 files are duplicated Comparing dir1 to dir2.
3 files are duplicated Comparing dir1 to dir3.
4 files are duplicated Comparing dir1 to dir4.
1 files are duplicated Comparing dir2 to dir3.
2 files are duplicated Comparing dir2 to dir4.
3 files are duplicated Comparing dir3 to dir4.
./count_dups.sh | sort -n | tail -1
4 files are duplicated Comparing dir1 to dir4.
Using the script count_dups.sh:
#!/bin/bash
# This assumes (among other things) that the dirs don't have spaces in the names
cd testdirs
declare -a DIRS=(`ls`);
function count_dups {
DUPS=`ls $1 $2 | sort | uniq -d | wc -l`
echo "$DUPS files are duplicated comparing $1 to $2."
}
LEFT=0
while [ $LEFT -lt ${#DIRS[#]} ] ; do
RIGHT=$(( $LEFT + 1 ))
while [ $RIGHT -lt ${#DIRS[#]} ] ; do
count_dups ${DIRS[$LEFT]} ${DIRS[$RIGHT]}
RIGHT=$(( $RIGHT + 1 ))
done
LEFT=$(( $LEFT + 1 ))
done

Can we create hash table for all of these 24 directories?
If the filename is just number , the hash function will be very easy to design.
If we can use hash table, it will be faster to search and find duplication.

Just for curiosity, I've done some simple tests: 24 directories with approximately 3900 files in each (a random number between 0 and 9999). Both bash-scripts take around 10 seconds each. Here is a basic python-script doing the same in ~0.2s:
#!/usr//bin/python
import sys, os
def get_max_duplicates(path):
items = [(d,set(os.listdir(os.path.join(path,d)))) \
for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
if len(items) < 2:
# need at least two directories
return ("","",0)
values = [(items[i][0],items[j][0],len(items[i][1].intersection(items[j][1]))) \
for i in range(len(items)) for j in range(i+1, len(items))]
return max(values, key=lambda a: a[2])
def main():
path = sys.argv[1] if len(sys.argv)==2 else os.getcwd()
r = get_max_duplicates(path)
print "%s and %s share %d files" % r
if __name__ == '__main__':
main()
As mentioned by Richard, by using a hash-table (or set in python), we can speed things up. The intersection of two sets is O(min(len(set_a), len(set_b))) and we have to do N(N-1)/2=720 comparisons.

Related

Create a file with the sample, gene, and line count - linux

I am trying to create a file called depths that has the name of the sample, the gene, and then the number of times that gene is in the sample. The below code is what I have currently, but the output just has the file names. Ex. file name=ERR034597.MTCYB.sam
I want the file to have ERR034597 MTCYB 327, for example.
for i in genes/${i}.sam
filename=$(basename $i)
n_rows=$(cat $i | wc -l)
echo $filename $n_rows > depths
Here
for i in genes/${i}.sam
you're accessing the variable i before it has been assigned yet. This shouldn't work. What you probably want to do is
for i in genes/*.sam
filename=$(basename "$i")
n_rows=$(wc -l "$i")
echo "$filename" $n_rows > depths
And just another note. It's good practice to avoid unnecessary calls to cat and always quote the variables holding filenames.
If I understand what you are attempting, then you need a few more steps to isolate the first part of the filename, (e.g. ERR034597) and the gene (e.g. MTCYB) before writing the information to depths. You also need to consider whether you are replacing the contents of depths on each iteration (e.g. using >) or Appending to depths with >>.
Since your tag is [Linux], all we can presume is you have a POSIX shell and not an advanced shell like bash. To remove the .sam extension from filename and then separate into the first part and the gene before obtaining the line count, you can do something similar to the following:
#!/bin/sh
:> depths # truncate depths (optional - if required)
for i in genes/*.sam; do # loop over all .sam files
filename="$(basename "$i")" # remove path from name
filename="${filename%.sam}" # trim .sam extension from name
gene="${filename##*.}" # trim to last '.' save as gene
filename="${filename%.$gene}" # remove gene from end of name
n_rows=$(wc -l < "$i") # get number of lines in file
echo "$filename $gene $n_rows" >> depths # append vales to depths
done
Which would result in depths containing lines similar to:
ERR034597 MTCYB 92
(where the test file contained 92 lines)
Look things over and let me know if you have further questions.

Given two directory trees how to find which filenames are the same, considering only filenames satisfying a condition?

This answer tells me how to find the files with the same filename in two directories in bash:
diff -srq dir1/ dir2/ | grep identical
Now I want to consider files which satisfy a condition. If I use ls E*, I get back files starting with E. I want to do the same with the above command: give me the filenames which are different in dir1/ and dir2/, but consider only those starting with E.
I tried the following:
diff -srq dir1/E* dir2/E* | grep identical
but it did not work, I got this output:
diff: extra operand '/home/pal/konkoly/c6/elesbe3/1/EPIC_212291374-
c06-k2sc.dat.flag.spline' diff: Try 'diff --help' for more
information.
((/home/pal/konkoly/c6/elesbe3/1/EPIC_212291374-
c06-k2sc.dat.flag.spline is a file in the so-called dir1, but EPIC_212291374-
c06-k2sc.dat.flag.spline is not in the so-called dir2))
How can I solve this?
I tried doing it in the following way, based on this answer:
DIR1=$(ls dir1)
DIR2=$(ls dir2)
for i in $DIR1; do
for j in $DIR2; do
if [[ $i == $j ]]; then
echo "$i == $j"
fi
done
done
It works as above, but if I write DIR1=$(ls path1/E*) and DIR2=$(ls path2/E*), it does not, I get no output.
This is untested, but I'd try something like:
comm -12 <(cd dir1 && ls E*) <(cd dir2 && ls E*)
Basic idea:
Generate a list of filenames in dir1 that satisfy our condition. This can be done with ls E* because we're only dealing with a flat list of files. For subdirectories and recursion we'd use find instead (e.g. find . -name 'E*' -type f).
Put the filenames in a canonical order (e.g. by sorting them). We don't have to do anything here because E* expands in sorted order anyway. With find we might have to pipe the output into sort first.
Do the same thing to dir2.
Only output lines that are common to both lists, which can be done with comm -12.
comm expects to be passed two filenames on the command line, so we use the <( ... ) bash feature to spawn a subprocess and connect its output to a named pipe; the name of the pipe can then be given to comm.
The accepted answer works fine. Though if someone needs a python implementation, this also works:
import glob
dir1withpath=glob.glob("path/to/dir1/E*")
dir2withpath=glob.glob("path/to/dir2/E*")
dir1=[]
for index,each in enumerate(dir1withpath):
dir1list=dir1withpath[index].split("/")
dir1.append(dir1list[-1])
dir2=[]
for index,each in enumerate(dir2withpath):
dir2list=dir2withpath[index].split("/")
dir2.append(dir2list[-1])
for each1 in dir1:
for each2 in dir2:
if each1 == each2:
print(each1 + "is in both directories")

Find files not in numerical list

I have a giant list of files that are all currently numbered in sequential order with different file extensions.
3400.PDF
3401.xls
3402.doc
There are roughly 1400 of these files in a directory. What I would like to know is how to find numbers that do not exist in the sequence.
I've tried to write a bash script for this but my bash-fu is weak.
I can get a list of the files without their extensions by using
FILES=$(ls -1 | sed -e 's/\..*$//')
but a few places I've seen say to not use ls in this manner.
(15 days after asking, I couldn't relocate where I read this, if it existed at all...)
I can also get the first file via ls | head -n 1 but Im pretty sure I'm making this a whole lot more complicated that I need to.
Sounds like you want to do something like this:
shopt -s nullglob
for i in {1..1400}; do
files=($i.*)
(( ${#files[#]} > 0 )) || echo "no files beginning with $i";
done
This uses a glob to make an array of all files 1.*, 2.* etc. It then compares the length of the array to 0. If there are no files matching the pattern, the message is printed.
Enabling nullglob is important as otherwise, when there are no files matching the array will contain one element: the literal value '1.*'.
Based on deleted answer that was largely correct:
for i in $(seq 1 1400); do ls $i.* > /dev/null 2>&1 || echo $i; done
ls [0-9]* \
| awk -F. ' !seen[$1]++ { ++N }
END { for (n=1; N ; ++n) if (!seen[n]) print n; else --N }
'
Will stop when it's filled the last gap, sub in N>0 || n < 3000 to go at least that far.

Linux/Perl Returning list of folders that have not been modified for over x minutes

I have a directory that has multiple folders. I want to get a list of the names of folders that have not been modified in the last 60 minutes.
The folders will have multiple files that will remain old so I can't use -mmin +60
I was thinking I could do something with inverse though. Get a list of files that have been modified in 60 minutes -mmin -60 and then output the inverse of this list.
Not sure to go about doing that or if there is a simpler way to do so?
Eventually I will take these list of folders in a perl script and will add them to a list or something.
This is what I have so far to get the list of folders
find /path/to/file -mmin -60 | sed 's/\/path\/to\/file\///' | cut -d "/" -f1 | uniq
Above will give me just the names of the folders that have been updated.
There is a neat trick to do set operations on text lines like this with sort and uniq. You have already the paths that have been updated. Assume they are in a file called upd. A simple find -type d can give you all folders. Lets assume to have them in file all. Then run
cat all upd | sort |uniq -c |grep '^1'
All paths that appear in both files will have a count of 2 prefixed. All paths only appearing in file all will be prefixed with a 1. The lines prefixed with 1 represent the set difference between all and upd, i.e. the paths that were not touched. (I take it you are able to remove the prefix 1 yourself.)
Surely this can be done with perl or any other scripting language, but this simple sort|uniq is just too nice.-)
The diff command is made for this.
Given two files, "all":
# cat all
/dir1/old2
/dir2/old4
/dir2/old5
/dir1/new1
/dir2/old2
/dir2/old3
/dir1/old1
/dir1/old3
/dir1/new4
/dir2/new1
/dir2/old1
/dir1/new2
/dir2/new2
and "updated":
# cat updated
/dir2/new1
/dir1/new4
/dir2/new2
/dir1/new2
/dir1/new1
We can sort the files and run diff. For this task,I prefer inline sorting:
# diff <(sort all) <(sort updated)
4,6d3
< /dir1/old1
< /dir1/old2
< /dir1/old3
9,13d5
< /dir2/old1
< /dir2/old2
< /dir2/old3
< /dir2/old4
< /dir2/old5
If there are any files in "updated" that aren't in "all", they'll be prefixed with '>'.

How to find files with similar filename and how many of them there are with awk

I was tasked to delete old backup files from our Linux database (all except for the newest 3). Since we have multiple kinds of backups, I have to leave at least 3 backup files for each backup type.
My script should group all files with similar (matched) names together and delete all except for the last 3 files (I assume, that the OS will sort those files for me, so the newest backups will also be the last ones)
The files are in the format project_name.000000-000000.svndmp.bz2 where 0 can be any arbitrary digit and project_name can be any arbitrary name. The first 6 digits are part of the name, while the last 6 digits describe the backup's version.
So far, my code looks like this:
for i in *.svndmp.bz2 # only check backup files
do
nOfOccurences = # Need to find out, how many files have the same name
currentFile = 0
for f in awk -F"[.-]" '{print $1,$2}' $i # This doesn't work
do
if [nOfOccurences - $currentFile -gt 3]
then
break
else
rm $f
currentFile++
fi
done
done
I'm aware, that my script may try to remove old versions of a backup 4 times before moving on to the next backup. I'm not looking for performance or efficiency (we don't have that many backups).
My code is a result of 4 hours of searching the net and I'm running out of good Google queries (and my boss is starting to wonder why I'm still not back to my usual tasks)
Can anybody give me inputs, as to how I can solve my problems?
Find nOfOccurences
Make awk find files that fit the pattern "$1.$2-*"
Try this one, an see if it does what you want.
for project in `ls -1 | awk -F'-' '{ print $1}' | uniq`; do
files=`ls -1 ${project}* | sort`
n_occur=`echo "$files" | wc -l`
for f in $files; do
if ((n_occur < 3)); then
break
fi
echo "rm" $f;
((--n_occur))
done
done
If the output seems to be OK just replace the echo line.
Ah, and don't beat me if anything goes own. Use at your own risk only.

Resources