I want to compare a number of files and find out which files which are the same, but they are not necessarily text files(So please don't suggest diff)
The files can be in any format (ie binary files).
I found out that I can run md5sum to find the hash of each file and then compare it manually to check if they are the same . But how can I automate this process ?
Ps : I also found that I can store the md5sums in a file using
md5sum <file-names> | cat >md5sum.txt
but I am stuck on how to automate this process.
I would prefer this to be done via a script (language no-bar).
If you can use languages like perl or python with builtin support for hashes/dictionnaries, it's really easy.
Loop over file names and signature and create a hash with md5sum as key and list of files with that md5 as value.
Then loop over content of hash and show entries with more than one item. These are files likely to be identical (you can't be really sure with a signature based approach).
As people are asking for code, maybe something like below. That is a perl implementation. I may add an equivalent python sample later if it is wanted.
#!perl
my $same = {};
for my $x (#ARGV){
my ($sig, $name) = split(/\s+/, `md5sum $x`);
if (!defined($same{$sig})){$same{$sig} = []}
push #{$same{$sig}}, $name;
}
for my $sig (keys %same){
if (#{$same{$sig}} > 1){
print "Files with same MD5 : ".join('-', #{$same{$sig}})."\n";
}
}
Say you put that in a file same.pl, you call it like:
perl same.pl
exemple of use:
$ md5sum F*
c9904273735f3141c1dd61533e02246a F1
c9904273735f3141c1dd61533e02246a F2
c9904273735f3141c1dd61533e02246a F3
d41d8cd98f00b204e9800998ecf8427e F4
$ perl same.pl F1 F2 F3 F4
Files with same MD5 : F1-F2-F3
Below is a possible python version (working with both python2 and python3).
#!python
import hashlib
def md5sum(filename):
f = open(filename, mode='rb')
buf = f.read(128)
d = hashlib.md5(buf)
while len(buf) == 128:
buf = f.read(128)
d.update(buf)
return d.hexdigest()
if __name__ == "__main__":
import sys
same = {}
for name in sys.argv[1:]:
sig = md5sum(name)
same.setdefault(sig, []).append(name)
for k in same:
if len(same[k]) > 1:
print("Files with same MD5: {l}".format(l="-".join(same[k])))
Note that if you are comparing really large number of files, providing file names on command line as in the above exemples may not be enough and you should use some more elaborate way to do that (or put some glob inside the script), or the shell command line will overflow.
Related
I have 2 large files each containing long strings separated by newlines in different formats. I need to find similarities and differences between them. The Problem is that the formats of the two files differ.
File a:
9217:NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE:dasda97sda9sdadfghgg789hfg87ghf8fgh87
File b:
NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE
So now I want to extract the whole line containing NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE from File a to a new file and also delete this line in File a.
I have tried achieving this with meld and got to the point that it will at least show me the similarities only. Say File a has 3000 lines and File b has 120 lines, now I want to find the the lines with at least n consecutive identical chars and remove these from File a.
I found this and accordingly tried to use diff like this:
diff --unchanged-line-format='%L' --old-line-format='' \
--new-line-format='' a.txt b.txt
This didn't do anything I got no output whatsoever so I guess it exited with 0 and didn't find anything.
How can I make this work? I have Linux and Windows available.
Given the format of the files, the most efficient implementation would be something like this:
Load all b strings into a [hashtable] or [HashSet[string]]
Filter the contents of a by:
Extracting the substring from each line with String.Split(':') or similar
Check whether it exists in the set from step 1
$FilterStrings = [System.Collections.Generic.HashSet[string]]::new(
[string[]]#(
Get-Content .\path\to\b
)
)
Get-Content .\path\to\a |Where-Object {
# Split the line into the prefix, middle, and suffix;
# Discard the prefix and suffix
$null,$searchString,$null = $_.Split(":", 3)
if($FilterStrings.Contains($searchString)){
# we found a match, write it to the new file
$searchString |Add-Content .\path\to\matchedStrings.txt
# make sure it isn't passed through
$false
}
else {
# substring wasn't found to be in `b`, let's pass it through
$true
}
} |Set-Content .\path\to\filteredStrings.txt
I have a folder containing 5 files named respectively 'out1.jpg', 'out2a.jpg', 'out2b.jpg', 'out3.jpg' and 'out4.jpg' in addition to other files in different formats.
I have this Python script which is supposed to print all the filenames that match:
import fnmatch
import os
c = 1
for file in os.listdir('.'):
if fnmatch.fnmatch(file, 'out'+str(c)+'*.jpg'):
print(file)
c +=1
However, when I run this script,the output is limited to the following:
out1.jpg
out2a.jpg
out3.jpg
Anyone please has an idea how to change the script in order to display all the filenames that match (which are the 5 filenames that I mentioned)?
You are increasing c on each iteration (well, on each iteration that found a match but anyway...), so it cannot obviously match "out2a.jpg" AND "out2b.jpg". Assuming you want all file names that match "out" + some number + eventually something else, you can use character ranges instead; ie:
for file in os.listdir('.'):
if fnmatch.fnmatch(file, 'out[0-9]*.jpg'):
print(file)
NB : you might have to adjust the exact fnmatch pattern according to your needs and what you have in your directory.
You can also use glob.glob instead, which is both simpler and (according to the doc) more efficient:
import glob
for file in glob("out[0-9]*.jpg"):
print(file)
EDIT :
I totally understand why it does not display out2a.jpg and out2b.jpg together, but I didn't get why out4.jpg is not displayed!
Quite simply because os.listdir() does not necessarily returns the filenames in the same order as you seemed to expect (on my linux station here, "out4.jpg" comes before the other "outXXX.jpg" files). You can inspect what's happening just by adding a couple prints:
c = 1
for file in os.listdir('.'):
exp = 'out{}*.jpg'.format(c)
print("file: {} - c : {} - exp : {}".format(file, c, exp))
if fnmatch.fnmatch(file, exp):
print(file)
c +=1
And the result here:
file: sofnm.py~ - c : 1 - exp : out1*.jpg
file: out4.jpg - c : 1 - exp : out1*.jpg
file: out2b.jpg - c : 1 - exp : out1*.jpg
file: out1.jpg - c : 1 - exp : out1*.jpg
out1.jpg
file: out2a.jpg - c : 2 - exp : out2*.jpg
out2a.jpg
file: sofnm.py - c : 3 - exp : out3*.jpg
file: out42a.jpg - c : 3 - exp : out3*.jpg
file: out3.jpg - c : 3 - exp : out3*.jpg
out3.jpg
As you can see, your assumption that os.listdir() would return the files in a given order (starting with "out1.jpg" and ending with "out4.jpg") was wrong. As a general rule, when your code don't behave as you expect, tracing the code execution (and the relevant values) is most often the simplest way to find out why.
You are incrementing c after a file match, it is quite possible that file name is out2a.jpg but the value of c is 1. So, it will not match. You should either list all the files in ascending order so that out1 will come before out2 in listdir or you should use a generic numeric match instead of one by one like shown below:
import fnmatch
import os
for file in os.listdir('.'):
#print(file)
if fnmatch.fnmatch(file, 'out[0-9]*.jpg'):
print(file)
Running through this for loop, you're checking each file in the directory against a very specific file name (first out1*.jpg, then out2*.jpg) with no guarantee that the order of these files matches. When I tried to run the code locally for example, it first compared out2a.jpg with the pattern out1*.jpg, then out2b.jpg with out2*.jpg, then test.py (the script) with out3*.jpg.
You'd be better off using a module like glob (https://docs.python.org/3/library/glob.html) to search for 'out*.jpg': glob.glob('out[0-9]*.jpg').
For example:
Under the folder, the file list is like:
20110101
20110102
20110103
...
20140101
20140102
...
20171231
How can I start looping those files not from the natural beginning (20110101)
but from a middle one (20140101)?
Well you can get an unsorted list of all the files in the current directory with os.listdir(). So you need to first sort this alphabetically (the default when using the sorted() function), and find the index of that "beginning file" and iterate from there.
So, in code, the above would look something like:
import os
b = '20110101'
fs = sorted(os.listdir())
for f in fs[fs.index(b):]:
...
I am trying to write a python program that takes n number of text files , each file contains names , each name on a separate line like this
Steve
Mark
Sarah
what the program does is that it prints out only the names that exist in all the inputted files .
I am new to programming so I don't really know how to implement this idea , but I thought in recursion , still the program seems to run in an infinite loop , I am not sure what's the problem . is the implementation wrong ? if so , do you have a better idea of how to implement it ?
import sys
arguments = sys.argv[1:]
files = {}
file = iter(arguments)
for number in range(len(sys.argv[1:])):
files[number] = open(next(file))
def close_files():
for num in files:
files[num].close()
def start_next_file(line,files,orderOfFile):
print('starting next file')
if orderOfFile < len(files): # to avoid IndexError
for line_searched in files[orderOfFile]:
if line_searched.strip():
line_searched = line_searched[:-1]
print('searched line = '+line_searched)
print('searched compared to = ' + line)
if line_searched == line:
#good now see if that name exists in the other files as well
start_next_file(line,files,orderOfFile+1)
elif orderOfFile >= len(files): # when you finish searching all the files
print('got ya '+line) #print the name that exists in all the files
for file in files:
# to make sure the cursor is at the beginning of the read files
#so we can loop through them again
files[file].seek(0)
def start_find_match(files):
orderOfFile = 0
for line in files[orderOfFile] :
# for each name in the file see if it exists in all other files
if line.strip():
line = line[:-1]
print ('starting line = '+line)
start_next_file(line,files,orderOfFile+1)
start_find_match(files)
close_files()
I'm not sure how to fix your code exactly but here's one conceptual way to think about it.
listdir gets all the files in the directory as a list. We narrow that to only .txt files. Next, open, read, split on newlines, and lower to make a larger list containing names. So, files will be a list of lists. Last, find the intersection across all lists using some set logic.
import os
folder = [f for f in os.listdir() if f[-4:] == '.txt']
files = []
for i,file in enumerate(folder):
with open(file) as f:
files.append([name.lower() for name in f.read().splitlines()])
result = set.intersection(*map(set, files))
Example:
#file1.txt
john
smith
mary
sue
pretesh
ashton
olaf
Elsa
#file2.txt
David
Lorenzo
Cassy
Grant
elsa
Felica
Salvador
Candance
Fidel
olaf
Tammi
Pasquale
#file3.txt
Jaleesa
Domenic
Shala
Berry
Pamelia
Kenneth
Georgina
Olaf
Kenton
Milly
Morgan
elsa
Returns:
{'olaf', 'elsa'}
I've needed this a few times, and only now it occured to me, that maybe Vim could do it for me. I often save files whose numbers are many, and whose names do not matter (they're temporary anyway).
I have a directory full of files: file001.txt, file002.txt ...(they're not really named "filexxx.txt" - but for the sake of discussion ...). I often save a new one, and name it for example, file434.txt. Now since that's something I do often, I'd like to skip the naming checking part.
Is there a way vim script can be made as to check for the last filexxx.txt in the directory, and save the current buffer as filexxx+1. How should I go about writing something like that ? Has anyone done something like this before ?
All advices appreciated.
Put the following in ~/.vim/plugin/nextunused.vim
" nextunused.vim
" find the next unused filename that matches the given pattern
" counting up from 0. The pattern is used by printf(), so use %d for
" an integer and %03d for an integer left padded with zeroes of length 3.
function! GetNextUnused( pattern )
let i = 0
while filereadable(printf(a:pattern,i))
let i += 1
endwhile
return printf(a:pattern,i)
endfunction
" edit the next unused filename that matches the given pattern
command! -nargs=1 EditNextUnused :execute ':e ' . GetNextUnused('<args>')
" write the current buffer to the next unused filename that matches the given pattern
command! -nargs=1 WriteNextUnused :execute ':w ' . GetNextUnused('<args>')
" To use, try
" :EditNextUnused temp%d.txt
"
" or
"
" :WriteNextUnused path/to/file%03d.extension
"
So if you're in a directory where temp0000.txt through temp0100.txt are all already used
and you do :WriteNextUnused temp%04d.txt, it will write the current buffer to temp0101.txt.
How about a script that you can shell out to? Here is a quick python script that should accomplish what you need. Save the script as "highest.py" to somewhere in your path. From VIM, get into command mode and type
:!python highest.py "file*.txt"
It returns the highest numbered file in the current directory, or a message that no files matched. It handles leading 0's and could be generalized for more complex patterns is need be.
#!/usr/bin/python
#
# Finds the highest numbered file in a directory that matches a given pattern
# Patterns are specified with a *, where the * will be where the number will occur.
#
import os
import re
import sys
highest = "";
highestGroup = -1;
if (len(sys.argv) != 2):
print "Usage: python high.py \"pattern*.txt\""
exit()
pattern = sys.argv[1].replace('*', '(\d*)')
exp = re.compile(pattern)
dirList=os.listdir(".")
for fname in dirList:
matched = re.match(exp, fname)
if matched:
if ((highest == "") or (int(matched.group(1)) > highestGroup)):
highest = fname
highestGroup = int(matched.group(1))
if (highest == ""):
print "No files match the pattern: ", pattern
else:
print highest
You can write scripts for vim in many powerful languages (depending on how your vim is compiled), such as perl, python, ruby. If it's possible for you to use a vim that's compiled with the appropriate interpreter for one of these languages, this would probably be the easiest way for you to write the script you desire.